The OpenNET Project / Index page

[ новости /+++ | форум | wiki | теги | ]

/proc/sys/{vm,kernel} documented (proc kernel faq)

<< Предыдущая ИНДЕКС Поиск в статьях src Установить закладку Перейти на закладку Следующая >>
Ключевые слова: proc, kernel, faq,  (найти похожие документы)
_ RU.LINUX (2:5077/15.22) ___________________________________________ RU.LINUX _ From : Boris Tobotras 2:5020/400 26 Feb 98 14:56:18 Subj : /proc/sys/{vm,kernel} documented ________________________________________________________________________________ From: Boris Tobotras <> Documentation for /proc/sys/*/* version 0.1 (c) 1998, Rik van Riel <> 'Why', I hear you ask, 'would anyone even _want_ documentation for them sysctl files? If anybody really needs it, it's all in the source...' Well, this documentation is written because some people either don't know they need to tweak something, or because they don't have the time or knowledge to read the source code. Furthermore, the programmers who built sysctl have built it to be actually used, not just for the fun of programming it :-)
Legal blurb: As usual, there are two main things to consider: 1. you get what you pay for 2. it's free The consequences are that I won't guarantee the correctness of this document, and if you come to me complaining about how you screwed up your system because of wrong documentation, I won't feel sorry for you. I might even laugh at you... But ofcourse, if you _do_ manage to screw up your system using only the sysctl options used in this file, I'd like to hear of it. Not only to have a great laugh, but also to make sure that you're the last RTFMing person to screw up. In short, e-mail your suggestions, corrections and / or horror stories to: <> Rik van Riel.
Introduction: Sysctl is a means of configuring certain aspects of the kernel at run-time, and the /proc/sys/ directory is there so that you don't even need special tools to do it! In fact, there are only four things needed to use these config facilities: - a running Linux system - root access - common sense (this is especially hard to come by these days) - knowledge of what all those values mean As a quick 'ls /proc/sys' will show, the directory consists of several (arch-dependant?) subdirs. Each subdir is mainly about one part of the kernel, so you can do configuration on a piece by piece basis, or just some 'thematic frobbing'. The subdirs are about: debug/ <empty> fs/ specific filesystems binfmt_misc <linux/Documentation/binfmt_misc.txt> kernel/ global kernel info / tuning open file / inode tuning miscellaneous stuff net/ networking stuff, for documentation look in: <linux/Documentation/networking/> proc/ <empty> vm/ memory management tuning buffer and cache management These are the subdirs I have on my system. There might be more or other subdirs in another setup. If you see another dir, I'd really like to hear about it :-) Documentation for /proc/sys/kernel/* version 0.1 (c) 1998, Rik van Riel <> For general info and legal blurb, please look in README.
This file contains documentation for the sysctl files in /proc/sys/kernel/ and is valid for Linux kernel version 2.1. The files in this directory can be used to tune and monitor miscelaneous and general things in the operation of the Linux kernel. Since some of the files _can_ be used to screw up your system, it is advisable to read both documentation and source before actually making adjustments. Currently, these files are in /proc/sys/kernel: - ctrl-alt-del - dentry-state - domainname - file-max - file-nr - hostname - inode-max - inode-nr - inode-state - osrelease - ostype - panic - printk - securelevel - version
ctrl-alt-del: When the value in this file is 0, ctrl-alt-del is trapped and sent to the init(1) program to handle a graceful restart. When, however, the value is > 0, Linux' reaction to a Vulcan Nerve Pinch (tm) will be an immediate reboot, without even syncing it's dirty buffers. Note: when a program (like dosemu) has the keyboard in 'raw' mode, the ctrl-alt-del is intercepted by the program before it ever reaches the kernel tty layer, and it's up to the program to decide what to do with it.
dentry-state: From linux/fs/dentry.c: -------------------------------------------------------------- struct { int nr_dentry; int nr_unused; int age_limit; /* age in seconds */ int want_pages; /* pages requested by system */ int dummy[2]; } dentry_stat = {0, 0, 45, 0,}; -------------------------------------------------------------- Dentries are dynamically allocated and deallocated, and nr_dentry seems to be 0 all the time. Hence it's safe to assume that only nr_unused, age_limit and want_pages are used. Nr_unused seems to be exactly what it's name says. Age_limit is the age in seconds after which dcache entries can be reclaimed when memory is short and want_pages is nonzero when shrink_dcache_pages() has been called and the dcache isn't pruned yet.
domainname & hostname: These files can be controlled to set the domainname and hostname of your box. For the classic a simple: # echo "darkstar" > /proc/sys/kernel/hostname # echo "" > /proc/sys/kernel/domainname would suffice to set your hostname and domainname.
file-max & file-nr: The kernel allocates filehandles dynamically, but as yet it doesn't free them again... The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit. The three values in file-nr denote the number of allocated file handles, the number of used file handles and the maximum number of file handles. When the allocated filehandles come close to the maximum, but the number of actually used ones is far behind, you've encountered a peek in your filehandle usage and you don't need to increase the maximum.
inode-max, inode-nr & inode-state: As with filehandles, the kernel allocates the inode structures dynamically, but can't free them yet... The value in inode-max denotes the maximum number of inode handlers. This value should be 3-4 times larger as the value in file-max, since stdin, stdout and network sockets also need an inode struct to handle them. When you regularly run out of inodes, you need to increase this value. The file inode-nr contains the first two items from inode-state, so we'll skip to that file... Inode-state contains three actual numbers and four dummies. The actual numbers are, in order of appearance, nr_inodes, nr_free_inodes and preshrink. Nr_inodes stands for the number of inodes the system has allocated, this can be slightly more than inode-max because Linux allocates them one pagefull at a time. Nr_free_inodes represents the number of free inodes (?) and preshrink is nonzero when the nr_inodes > inode-max and the system needs to prune the inode list instead of allocating more.
osrelease, ostype & version: # cat osrelease 2.1.88 # cat ostype Linux # cat version #5 Wed Feb 25 21:49:24 MET 1998 The files osrelease and ostype should be clear enough. Version needs a little more clarification however. The '#5' means that this is the fifth kernel built from this source base and the date behind it indicates the time the kernel was built. The only way to tune these values is to rebuild the kernel :-)
panic: The value in this file represents the number of seconds the kernel waits before rebooting on a panic. When you use the software watchdog, the recommended setting is 60.
printk: The four values in printk denote: console_loglevel, default_message_loglevel, minimum_console_level and default_console_loglevel respectively. These values have influence on printk() behaviour when printing / logging error messages. See 'man 2 syslog' for more info on the different loglevels. - console_loglevel: messages with a higher priority than this will be printed to the console - default_message_level: messages without an explicit priority will be printed with this priority - minimum_console_loglevel: minimum (highest) value to which console_loglevel can be set - default_console_loglevel: default value for console_loglevel Note: a quick look in linux/kernel/printk.c will reveal that these variables aren't put inside a structure, so their order in-core isn't formally guaranteed and garbage values _might_ occur when the compiler changes. (???)
securelevel: When the value in this file is nonzero, root is prohibited from: - changing the immutable and append-only flags on files - changing sysctl things (limited ???)
real-root-dev: (CONFIG_INITRD only) This file is used to configure the real root device when using an initial ramdisk to configure the system before switching to the 'real' root device. See linux/Documentation/initrd.txt for more info.
reboot-cmd: (Sparc only) ??? This seems to be a way to give an argument to the Sparc ROM/Flash boot loader. Maybe to tell it what to do after rebooting. ??? Documentation for /proc/sys/vm/* version 0.1 (c) 1998, Rik van Riel <> For general info and legal blurb, please look in README.
This file contains the documentation for the sysctl files in /proc/sys/vm and is valid for Linux kernel version 2.1. The files in this directory can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel, and one of the files (bdflush) also has a little influence on disk usage. Currently, these files are in /proc/sys/vm: - bdflush - freepages - overcommit_memory - swapctl - swapout_interval
bdflush: This file controls the operation of the bdflush kernel daemon. The source code to this struct can be found in linux/mm/buffer.c. It currently contains 9 integer values, of which 6 are actually used by the kernel. From linux/fs/buffer.c: -------------------------------------------------------------- union bdflush_param{ struct { int nfract; /* Percentage of buffer cache dirty to activate bdflush */ int ndirty; /* Maximum number of dirty blocks to write out per wake-cycle */ int nrefill; /* Number of clean buffers to try to obtain each time we call refill */ int nref_dirt; /* Dirty buffer threshold for activating bdflush when trying to refill buffers. */ int dummy1; /* unused */ int age_buffer; /* Time for normal buffer to age before we flush it */ int age_super; /* Time for superblock to age before we flush it */ int dummy2; /* unused */ int dummy3; /* unused */ } b_un; unsigned int data[N_PARAM]; } bdf_prm = {{40, 500, 64, 256, 15, 30*HZ, 5*HZ, 1884, 2}}; -------------------------------------------------------------- The first parameter governs the maximum number of of dirty buffers in the buffer cache. Dirty means that the contents of the buffer still have to be written to disk (as opposed to a clean buffer, which can just be forgotten about). Setting this to a high value means that Linux can delay disk writes for a long time, but it also means that it will have to do a lot I/O at once when memory becomes short. A low value will spread out disk I/O more evenly. The second parameter (ndirty) gives the maximum number of dirty buffers that bdflush can write to the disk in one time. A high value will mean delayed, bursty I/O, while a small value can lead to memory shortage when bdflush isn't woken up often enough... The third parameter (nrefill) is the number of buffers that bdflush will add to the list of free buffers when refill_freelist() is called. It is nessecary to allocate free buffers beforehand, since the buffers often are of a different size than memory pages and some bookkeeping needs to be done beforehand. The higher the number, the more memory will be wasted and the less often refill_freelist() will need to run. When refill_freelist() comes across more than nref_dirt dirty buffers, it will wake up bdflush. Finally, the age_buffer and age_super parameters govern the maximum time Linux waits before writing out a dirty buffer to disk. The value is expressed in jiffies (clockticks), the number of jiffies per second is 100, except on Alpha machines (1024). Age_buffer is the maximum age for data blocks, while age_super is for filesystem metadata.
freepages: This file contains three values: min_free_pages, free_pages_low and free_pages_high in order. These numbers are used by the VM subsystem to keep a reasonable number of pages on the free page list, so that programs can allocate new pages without having to wait for the system to free used pages first. The actual freeing of pages is done by kswapd, a kernel daemon. min_free_pages -- when the number of free pages reaches this level, only the kernel can allocate memory for _critical_ tasks only free_pages_low -- when the number of free pages drops below this level, kswapd is woken up immediately free_pages_high -- this is kswapd's target, when more than free_pages_high pages are free, kswapd will stop swapping. When the number of free pages is between free_pages_low and free_pages_high, and kswapd hasn't run for swapout_interval jiffies, then kswapd is woken up too. See swapout_interval for more info. When free memory is always low on your system, and kswapd has trouble keeping up with allocations, you might want to increase these values, especially free_pages_high and perhaps free_pages_low. I've found that a 1:2:4 relation for these values tend to work rather well in a heavily loaded system.
overcommit_memory: This file contains only one value. The followin algorithm is used to decide if there's enough memory. If the value of overcommit_memory > 0, then there's always enough memory :-). This is a useful feature, since programs often malloc() huge amounts of memory 'just in case', while they only use a small part of it. Leaving this value at 0 will lead to the failure of such a huge malloc(), when in fact the system has enough memory for the program to run... On the other hand, enabling this feature can cause you to run out of memory and thrash the system to death, so large and/or important servers will want to set this value to 0. From linux/mm/mmap.c: -------------------------------------------------------------- static inline int vm_enough_memory(long pages) { /* Stupid algorithm to decide if we have enough memory: while * simple, it hopefully works in most obvious cases.. Easy to * fool it, but this should catch most mistakes. */ long freepages; /* Sometimes we want to use more memory than we have. */ if (sysctl_overcommit_memory) return 1; freepages = buffermem >> PAGE_SHIFT; freepages += page_cache_size; freepages >>= 1; freepages += nr_free_pages; freepages += nr_swap_pages; freepages -= num_physpages >> 4; return freepages > pages; }
swapctl: This file contains no less than 16 variables, of which about half is actually used :-) In the listing below, the unused variables are marked as such. All of these values are used by kswapd, and the usage can be found in linux/mm/vmscan.c. From linux/include/linux/swapctl.h: -------------------------------------------------------------- typedef struct swap_control_v5 { unsigned int sc_max_page_age; unsigned int sc_page_advance; unsigned int sc_page_decline; unsigned int sc_page_initial_age; unsigned int sc_max_buff_age; /* unused */ unsigned int sc_buff_advance; /* unused */ unsigned int sc_buff_decline; /* unused */ unsigned int sc_buff_initial_age; /* unused */ unsigned int sc_age_cluster_fract; unsigned int sc_age_cluster_min; unsigned int sc_pageout_weight; unsigned int sc_bufferout_weight; unsigned int sc_buffer_grace; /* unused */ unsigned int sc_nr_buffs_to_free; /* unused */ unsigned int sc_nr_pages_to_free; /* unused */ enum RCL_POLICY sc_policy; /* RCL_PERSIST hardcoded */ } swap_control_v5; -------------------------------------------------------------- The first four variables are used to keep track of Linux' page aging. Page aging is a bookkeeping method to keep track of which pages of memory are used often, and which pages can be swapped out without consequenses. When a page is swapped in, it starts at sc_page_initial_age (default 3) and when the page is scanned by kswapd, it's age is adjusted according to the following scheme: - if the page was used since the last time we scanned, it's age is increased sc_page_advance (default 3) up to a maximum of sc_max_page_age (default 20) - else (it wasn't used) it's age is decreased sc_page_decline (default 1) And when a page reaches age 0, it's ready to be swapped out. The variables sc_age_cluster_fract till sc_bufferout_weight have to do with the amount of scanning kswapd is doing on each call to try_to_swap_out(). sc_age_cluster_fract is used to calculate how many pages from a process are to be scanned by kswapd. The formula used is sc_age_cluster_fract/1024 * RSS, so if you want kswapd to scan the whole process, sc_age_cluster_fract needs to have a value of 1024. The minimum number of pages kswapd will scan is represented by sc_age_cluster_min, this is done so kswapd will also scan small processes. The values of sc_pageout_weight and sc_bufferout_weight are used to control the how many tries kswapd will do in order to swapout one page / buffer. As with sc_age_cluster_fract, the actual value is calculated by several more or less complex formulae and the default value is good for every purpose.
swapout_interval: The single value in this file controls the amount of time between successive wakeups of kswapd when nr_free_pages is between free_pages_low and free_pages_high. The default value of HZ/4 is usually right, but when kswapd can't keep up with the number of allocations in your system, you might want to decrease this number. -- Best regards, -- Boris. --- ifmail v.2.14dev2 * Origin: Jet Infosystems (2:5020/400@fidonet)

<< Предыдущая ИНДЕКС Поиск в статьях src Установить закладку Перейти на закладку Следующая >>

 Добавить комментарий

Inferno Solutions
Hosting by

Закладки на сайте
Проследить за страницей
Created 1996-2022 by Maxim Chirkov
Добавить, Поддержать, Вебмастеру