Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[ source navigation ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]
FreeBSD/Linux Kernel Cross Reference
|
Name | Size | Last modified (GMT) | Description | |
Parent directory | 2013-10-07 20:27:23 | |||
README | 15209 bytes | 2009-02-08 18:53:02 | ||
README.softupdates | 888 bytes | 2009-02-08 18:53:02 | ||
dinode.h | 5748 bytes | 2010-02-10 22:25:49 | ||
dir.h | 6209 bytes | 2013-10-07 20:27:23 | ||
dirhash.h | 5282 bytes | 2009-02-08 18:53:02 | ||
ffs_alloc.c | 56160 bytes | 2013-10-07 20:27:23 | ||
ffs_balloc.c | 14933 bytes | 2013-10-07 20:27:23 | ||
ffs_extern.h | 5381 bytes | 2010-02-10 22:25:49 | ||
ffs_inode.c | 17091 bytes | 2013-12-06 09:31:48 | ||
ffs_rawread.c | 10139 bytes | 2013-10-07 20:27:23 | ||
ffs_softdep.c | 151254 bytes | 2013-12-06 09:31:48 | ||
ffs_softdep_stub.c | 5363 bytes | 2009-02-08 18:53:02 | ||
ffs_subr.c | 7749 bytes | 2013-12-06 09:31:48 | ||
ffs_tables.c | 5786 bytes | 2013-10-07 20:27:23 | ||
ffs_vfsops.c | 34395 bytes | 2013-12-06 09:31:48 | ||
ffs_vnops.c | 3881 bytes | 2013-10-07 20:27:23 | ||
fs.h | 25042 bytes | 2010-12-20 21:05:48 | ||
inode.h | 6748 bytes | 2010-02-10 22:25:49 | ||
quota.h | 7759 bytes | 2009-02-08 18:53:02 | ||
softdep.h | 27747 bytes | 2009-02-08 18:53:02 | ||
ufs_bmap.c | 9929 bytes | 2013-10-07 20:27:24 | ||
ufs_dirhash.c | 27800 bytes | 2013-10-07 20:27:24 | ||
ufs_extern.h | 4768 bytes | 2013-10-07 20:27:24 | ||
ufs_ihash.c | 5153 bytes | 2013-10-07 20:27:24 | ||
ufs_inode.c | 4936 bytes | 2013-12-06 09:31:48 | ||
ufs_lookup.c | 34619 bytes | 2013-12-06 09:31:48 | ||
ufs_quota.c | 25238 bytes | 2013-10-07 20:27:24 | ||
ufs_readwrite.c | 11744 bytes | 2013-10-07 20:27:24 | ||
ufs_types.h | 1993 bytes | 2009-02-08 18:53:02 | ||
ufs_vfsops.c | 5818 bytes | 2013-10-07 20:27:24 | ||
ufs_vnops.c | 58484 bytes | 2013-12-06 09:31:48 | ||
ufsmount.h | 4305 bytes | 2013-10-07 20:27:24 |
1 # $FreeBSD: src/sys/ufs/ffs/README,v 1.4 1999/12/03 00:34:26 billf Exp $ 2 # $DragonFly: src/sys/vfs/ufs/README,v 1.4 2004/07/18 19:43:48 drhodus Exp $ 3 4 Introduction 5 6 This package constitutes the alpha distribution of the soft update 7 code updates for the fast filesystem. 8 9 For More information on what Soft Updates is, see: 10 http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/ 11 12 Status 13 14 My `filesystem torture tests' (described below) run for days without 15 a hitch (no panic's, hangs, filesystem corruption, or memory leaks). 16 However, I have had several panic's reported to me by folks that 17 are field testing the code which I have not yet been able to 18 reproduce or fix. Although these panic's are rare and do not cause 19 filesystem corruption, the code should only be put into production 20 on systems where the system administrator is aware that it is being 21 run, and knows how to turn it off if problems arise. Thus, you may 22 hand out this code to others, but please ensure that this status 23 message is included with any distributions. Please also include 24 the file ffs_softdep.stub.c in any distributions so that folks that 25 cannot abide by the need to redistribute source will not be left 26 with a kernel that will not link. It will resolve all the calls 27 into the soft update code and simply ignores the request to enable 28 them. Thus you will be able to ensure that your other hooks have 29 not broken anything and that your kernel is softdep-ready for those 30 that wish to use them. Please report problems back to me with 31 kernel backtraces of panics if possible. This is massively complex 32 code, and people only have to have their filesystems hosed once or 33 twice to avoid future changes like the plague. I want to find and 34 fix as many bugs as soon as possible so as to get the code rock 35 solid before it gets widely released. Please report any bugs that 36 you uncover to mckusick@mckusick.com. 37 38 Performance 39 40 Running the Andrew Benchmarks yields the following raw data: 41 42 Phase Normal Softdep What it does 43 1 3s <1s Creating directories 44 2 8s 4s Copying files 45 3 6s 6s Recursive directory stats 46 4 8s 9s Scanning each file 47 5 25s 25s Compilation 48 49 Normal: 19.9u 29.2s 0:52.8 135+630io 50 Softdep: 20.3u 28.5s 0:47.8 103+363io 51 52 Another interesting datapoint are my `filesystem torture tests'. 53 They consist of 1000 runs of the andrew benchmarks, 1000 copy and 54 removes of /etc with randomly selected pauses of 0-60 seconds 55 between each copy and remove, and 500 find from / with randomly 56 selected pauses of 100 seconds between each run). The run of the 57 torture test compares as follows: 58 59 With soft updates: writes: 6 sync, 1,113,686 async; run time 19hr, 50min 60 Normal filesystem: writes: 1,459,147 sync, 487,031 async; run time 27hr, 15min 61 62 The upshot is 42% less I/O and 28% shorter running time. 63 64 Another interesting test point is a full MAKEDEV. Because it runs 65 as a shell script, it becomes mostly limited by the execution speed 66 of the machine on which it runs. Here are the numbers: 67 68 With soft updates: 69 70 labrat# time ./MAKEDEV std 71 2.2u 32.6s 0:34.82 100.0% 0+0k 11+36io 0pf+0w 72 73 labrat# ls | wc 74 522 522 3317 75 76 Without soft updates: 77 78 labrat# time ./MAKEDEV std 79 2.0u 40.5s 0:42.53 100.0% 0+0k 11+1221io 0pf+0w 80 81 labrat# ls | wc 82 522 522 3317 83 84 Of course, some of the system time is being pushed 85 to the syncer process, but that is a different story. 86 87 To show a benchmark designed to highlight the soft update code 88 consider a tar of zero-sized files and an rm -rf of a directory tree 89 that has at least 50 files or so at each level. Running a test with 90 a directory tree containing 28 directories holding 202 empty files 91 produces the following numbers: 92 93 With soft updates: 94 tar: 0.0u 0.5s 0:00.65 76.9% 0+0k 0+44io 0pf+0w (0 sync, 33 async writes) 95 rm: 0.0u 0.2s 0:00.20 100.0% 0+0k 0+37io 0pf+0w (0 sync, 72 async writes) 96 97 Normal filesystem: 98 tar: 0.0u 1.1s 0:07.27 16.5% 0+0k 60+586io 0pf+0w (523 sync, 0 async writes) 99 rm: 0.0u 0.5s 0:01.84 29.3% 0+0k 0+318io 0pf+0w (258 sync, 65 async writes) 100 101 The large reduction in writes is because inodes are clustered, so 102 most of a block gets allocated, then the whole block is written 103 out once rather than having the same block written once for each 104 inode allocated from it. Similarly each directory block is written 105 once rather than once for each new directory entry. Effectively 106 what the update code is doing is allocating a bunch of inodes 107 and directory entries without writing anything, then ensuring that 108 the block containing the inodes is written first followed by the 109 directory block that references them. If there were data in the 110 files it would further ensure that the data blocks were written 111 before their inodes claimed them. 112 113 Copyright Restrictions 114 115 Please familiarize yourself with the copyright restrictions 116 contained at the top of either the sys/ufs/ffs/softdep.h or 117 sys/ufs/ffs/ffs_softdep.c file. The key provision is similar 118 to the one used by the DB 2.0 package and goes as follows: 119 120 Redistributions in any form must be accompanied by information 121 on how to obtain complete source code for any accompanying 122 software that uses the this software. This source code must 123 either be included in the distribution or be available for 124 no more than the cost of distribution plus a nominal fee, 125 and must be freely redistributable under reasonable 126 conditions. For an executable file, complete source code 127 means the source code for all modules it contains. It does 128 not mean source code for modules or files that typically 129 accompany the operating system on which the executable file 130 runs, e.g., standard library modules or system header files. 131 132 The idea is to allow those of you freely redistributing your source 133 to use it while retaining for myself the right to peddle it for 134 money to the commercial UNIX vendors. Note that I have included a 135 stub file ffs_softdep.c.stub that is freely redistributable so that 136 you can put in all the necessary hooks to run the full soft updates 137 code, but still allow vendors that want to maintain proprietary 138 source to have a working system. I do plan to release the code with 139 a `Berkeley style' copyright once I have peddled it around to the 140 commercial vendors. If you have concerns about this copyright, 141 feel free to contact me with them and we can try to resolve any 142 difficulties. 143 144 Soft Dependency Operation 145 146 The soft update implementation does NOT require ANY changes 147 to the on-disk format of your filesystems. Furthermore it is 148 not used by default for any filesystems. It must be enabled on 149 a filesystem by filesystem basis by running tunefs to set a 150 bit in the superblock indicating that the filesystem should be 151 managed using soft updates. If you wish to stop using 152 soft updates due to performance or reliability reasons, 153 you can simply run tunefs on it again to turn off the bit and 154 revert to normal operation. The additional dynamic memory load 155 placed on the kernel malloc arena is approximately equal to 156 the amount of memory used by vnodes plus inodes (for a system 157 with 1000 vnodes, the additional peak memory load is about 300K). 158 159 Kernel Changes 160 161 There are two new changes to the kernel functionality that are not 162 contained in in the soft update files. The first is a `trickle 163 sync' facility running in the kernel as process 3. This trickle 164 sync process replaces the traditional `update' program (which should 165 be commented out of the /etc/rc startup script). When a vnode is 166 first written it is placed 30 seconds down on the trickle sync 167 queue. If it still exists and has dirty data when it reaches the 168 top of the queue, it is sync'ed. This approach evens out the load 169 on the underlying I/O system and avoids writing short-lived files. 170 The papers on trickle-sync tend to favor aging based on buffers 171 rather than files. However, I sync on file age rather than buffer 172 age because the data structures are much smaller as there are 173 typically far fewer files than buffers. Although this can make the 174 I/O spikey when a big file times out, it is still much better than 175 the wholesale sync's that were happening before. It also adapts 176 much better to the soft update code where I want to control 177 aging to improve performance (inodes age in 10 seconds, directories 178 in 15 seconds, files in 30 seconds). This ensures that most 179 dependencies are gone (e.g., inodes are written when directory 180 entries want to go to disk) reducing the amount of rollback that 181 is needed. 182 183 The other main kernel change is to split the vnode freelist into 184 two separate lists. One for vnodes that are still being used to 185 identify buffers and the other for those vnodes no longer identifying 186 any buffers. The latter list is used by getnewvnode in preference 187 to the former. 188 189 Packaging of Kernel Changes 190 191 The sys subdirectory contains the changes and additions to the 192 kernel. My goal in writing this code was to minimize the changes 193 that need to be made to the kernel. Thus, most of the new code 194 is contained in the two new files softdep.h and ffs_softdep.c. 195 The rest of the kernel changes are simply inserting hooks to 196 call into these two new files. Although there has been some 197 structural reorganization of the filesystem code to accommodate 198 gathering the information required by the soft update code, 199 the actual ordering of filesystem operations when soft updates 200 are disabled is unchanged. 201 202 The kernel changes are packaged as a set of diffs. As I am 203 doing my development in BSD/OS, the diffs are relative to the 204 BSD/OS versions of the files. Because BSD/OS recently had 205 4.4BSD-Lite2 merged into it, the Lite2 files are a good starting 206 point for figuring out the changes. There are 40 files that 207 require change plus the two new files. Most of these files have 208 only a few lines of changes in them. However, four files have 209 fairly extensive changes: kern/vfs_subr.c, vfs/ufs/ufs_lookup.c, 210 vfs/ufs/ufs_vnops.c, and vfs/ffs/ffs_alloc.c. For these four 211 files, I have provided the original Lite2 version, the Lite2 212 version with the diffs merged in, and the diffs between the 213 BSD/OS and merged version. Even so, I expect that there will 214 be some difficulty in doing the merge; I am certainly willing 215 to assist in helping get the code merged into your system. 216 217 Packaging of Utility Changes 218 219 The utilities subdirectory contains the changes and additions 220 to the utilities. There are diffs to three utilities enclosed: 221 222 tunefs - add a flag to enable and disable soft updates 223 224 mount - print out whether soft updates are enabled and 225 also statistics on number of sync and async writes 226 227 fsck - tighter checks on acceptable errors and a slightly 228 different policy for what to put in lost+found on 229 filesystems using soft updates 230 231 In addition you should recompile vmstat so as to get reports 232 on the 13 new memory types used by the soft update code. 233 It is not necessary to use the new version of fsck, however it 234 would aid in my debugging if you do. Also, because of the time 235 lag between deleting a directory entry and the inode it 236 references, you will find a lot more files showing up in your 237 lost+found if you do not use the new version. Note that the 238 new version checks for the soft update flag in the superblock 239 and only uses the new algorithms if it is set. So, it will run 240 unchanged on the filesystems that are not using soft updates. 241 242 Operation 243 244 Once you have booted a kernel that incorporates the soft update 245 code and installed the updated utilities, do the following: 246 247 1) Comment out the update program in /etc/rc. 248 249 2) Run `tunefs -n enable' on one or more test filesystems. 250 251 3) Mount these filesystems and then type `mount' to ensure that 252 they have been enabled for soft updates. 253 254 4) Copy the test directory to a softdep filesystem, chdir into 255 it and run `./doit'. You may want to check out each of the 256 three subtests individually first: doit1 - andrew benchmarks, 257 doit2 - copy and removal of /etc, doit3 - find from /. 258 259 ==== 260 Additional notes from Feb 13 261 262 When removing huge directories of files, it is possible to get 263 the incore state arbitrarily far ahead of the disk. Maintaining 264 all the associated depedency information can exhaust the kernel 265 malloc arena. To avoid this senario, I have put some limits on 266 the soft update code so that it will not be allowed to rampage 267 through all of the kernel memory. I enclose below the relevant 268 patches to vnode.h and vfs_subr.c (which allow the soft update 269 code to speed up the filesystem syncer process). I have also 270 included the diffs for ffs_softdep.c. I hope to make a pass over 271 ffs_softdep.c to isolate the differences with my standard version 272 so that these diffs are less painful to incorporate. 273 274 Since I know you like to play with tuning, I have put the relevant 275 knobs on sysctl debug variables. The tuning knobs can be viewed 276 with `sysctl debug' and set with `sysctl -w debug.<name>=value'. 277 The knobs are as follows: 278 279 debug.max_softdeps - limit on any given resource 280 debug.tickdelay - ticks to delay before allocating 281 debug.max_limit_hit - number of times tickdelay imposed 282 debug.rush_requests - number of rush requests to filesystem syncer 283 284 The max_softdeps limit is derived from vnodesdesired which in 285 turn is sized based on the amount of memory on the machine. 286 When the limit is hit, a process requesting a resource first 287 tries to speed up the filesystem syncer process. Such a 288 request is recorded as a rush_request. After syncdelay / 2 289 unserviced rush requests (typically 15) are in the filesystem 290 syncers queue (i.e., it is more than 15 seconds behind in its 291 work), the process requesting the memory is put to sleep for 292 tickdelay seconds. Such a delay is recorded in max_limit_hit. 293 Following this delay it is granted its memory without further 294 delay. I have tried the following experiments in which I 295 delete an MH directory containing 16,703 files: 296 297 Run # 1 2 3 298 299 max_softdeps 4496 4496 4496 300 tickdelay 100 == 1 sec 20 == 0.2 sec 2 == 0.02 sec 301 max_limit_hit 16 == 16 sec 27 == 5.4 sec 203 == 4.1 sec 302 rush_requests 147 102 93 303 run time 57 sec 46 sec 45 sec 304 I/O's 781 859 936 305 306 When run with no limits, it completes in 40 seconds. So, the 307 time spent in delay is directly added to the bottom line. 308 Shortening the tick delay does cut down the total running time, 309 but at the expense of generating more total I/O operations 310 due to the rush orders being sent to the filesystem syncer. 311 Although the number of rush orders decreases with a shorter 312 tick delay, there are more requests in each order, hence the 313 increase in I/O count. Also, although the I/O count does rise 314 with a shorter delay, it is still at least an order of magnitude 315 less than without soft updates. Anyway, you may want to play 316 around with these value to see what works best and to see if 317 you can get an insight into how best to tune them. If you get 318 out of memory panic's, then you have max_softdeps set too high. 319 The max_limit_hit and rush_requests show be reset to zero 320 before each run. The minimum legal value for tickdelay is 2 321 (if you set it below that, the code will use 2).
[ source navigation ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]
This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.