Act I - The Linux Kernel's VFS - exposition
- Is the component in the kernel that handles file-systems, directory and
file access.
- It abstracts common tasks of many file-systems.
- And presents the user with a unified interface, via the file-related
system calls.
Act II - Relations Of The VFS With The Rest Of The System - the plot thickens...
- The VFS interacts with file-systems
- ... which interact with the buffer cache, page-cache and block devices.
- The VFS also interacts with the user...
- ... via system calls
- Finally, the VFS supplies data structures such as the dcache, inodes cache
and open files tables...
VFS And The System - Static Relations
The following figure shows the static relations of the VFS with the rest of
the system:
VFS And The System - Dynamic Relations
The following figure shows the dynamic relations of the VFS objects with the
rest of the system:
Act III - Internal Components Of The VFS - the naked souls...
The following components comprise the VFS:
- dcache - cache of "dentry" objects, used to translate paths to inodes.
- inode cache - cache of "inode" objects, used to represent files and
directories on the file systems.
- common code - code which is used by many file-systems, was moved into
functions which are now part of the VFS.
About Caches
- Generally, the VFS attempts to keep caches of objects, because:
- Allocating objects (memory) takes time.
- Resolving objects takes time.
- So for each object type the VFS holds:
- A list of used (i.e. may NOT be deleted) objects.
- A list of resolved but un-used (i.e. recently resolved but may be
deleted) objects.
- A list of un-assigned and un-used (i.e. completely free) objects - by
using SLAB caches.
The Dcache
- Contains a hash table of "dentry" objects, each representing a translation
from a path to an inode (including "negative" dentries - representing
recent lookups to non-existing files).
- The dentries are also connected in a tree structure, representing the
structures of files and directories on the mounted file systems.
- An entry remains in the dcache until its file-system is un-mounted...
- ... or until it is pruned during a cache shrink, which happens once
every 300 seconds, by the swap-daemon (kswapd)...
- ... or when the swap-daemon (kswapd) needs to free space.
- See file fs/cache.c, function prune_dcache, for the gory details.
The Inode Cache
- Contains a hash table of "inode" objects, each representing a
file/directory on a mounted file system.
- Each inode is potentially linked with a dentry...
- ... as well as with the file-system it belongs to.
- Each inode contains a list of pages and (dirty) buffers belonging to the
file/directory this inode represents.
The File Object
- An object representing an open file instance.
- Points to a dentry, which points to an inode that represents the
actual file...
- Contains information such as the "current position", access mode using
which the file was opened, uid and gid via which the file is open, etc.
- Contains a pointer to 'file operations', which is set by the underlying
file-system (or device driver - in case we opened a device-special file).
Act IV - The VFS Sources - there's a birdhouse in your code...
Lets review the more-interesting source files of the VFS, all found under
directory "fs":
- super.c - handling of super-blocks and of file-system types.
- namespace.c - file-system mount and unmount.
- namei.c - file tree manipulation (lookups, inode creation and deletion,
permissions checking...).
- read_write.c - implementation of read, write and lseek system calls for
files.
- dquot.c - handling of disk usage quotas.
- dcache.c - implementation of the dcache...
- inode.c - implementation of the inode object and inode cache.
Act V - Example VFS Operations - three paths to your soul...
- Let us look at a few interesting VFS operations:
- Path to inode translation.
- File open.
- File read.
VFS Operations - Path To Inode Translation
Given a (full or relative) path to a file, find its inode:
Entry function: user_path_walk, in include/linux/fs.h.
Actual lookup function: link_path_walk, in fs/namei.c.
- First, get the dentry to "/" (if it's a full path) or to "." (if it's
a relative path).
- Start scanning the path, and follow via the dcache.
- Handle "." by skipping.
- Handle ".." using dentry->d_parent.
- Handle others via a cache lookup.
- If not found in the cache - ask the underlying filesystem to perform
a real lookup.
VFS Operations - File Open
Given a file path, an open mode and a file permissions mask:
Entry function: sys_open, in fs/open.c.
Underlying open function: filp_open, in fs/open.c.
- Allocate a free file descriptor.
- Try to open the file (next slide).
- On success, put the new 'struct file' in the fd table of the process.
On error, free the allocated file descriptor.
Actually Opening The File
Entry function: open_namei, in fs/namei.c.
- If not opening in create mode (no O_CREAT flag given):
lookup the file via the dentry cache (path_lookup -> path_walk ->
link_path_walk).
- Otherwise (it is an open in create mode):
- Lookup the parent directory (path_lookup again). If it does not
exist, or is not a directory - fail the operation.
- Lookup the file in the parent directory. If it does not exist, create
it (see vfs_create later on).
- Handle special cases (flags mismatch, file is a directory, file is
a link...).
- In both cases (open with create or file already exists):
- Perform sanity checks.
- Check permissions.
- Check various mode limitations (e.g. file is read-only and trying to
open for write, trying to open a device file on a 'no_dev' file-system
mount, etc.).
- Handle truncation.
VFS Create
Entry function: vfs_create, in fs/namei.c.
- Check if we may create a new entry in the given directory (mostly
permission checking).
- Check that the underlying inode has a 'create' inode operation.
- Invoke the 'create' inode operation (of the file-system).
VFS Operations - File Read
Entry function: sys_read, in fs/read_write.c:
- Using the file descriptor, get the (already opened) file struct.
- Verify that the file's access mode allows read.
- Check locks.
- Invoke the underlying 'read' file operation of the file's inode.
Reading Via The Page Cache
Entry function: do_generic_file_read, in mm/filemap.c:
- Get the address mapping of the file's inode.
- Translate the read position to a page index (each page contains
2^PAGE_CACHE_SHIFT bytes).
- Calculate read-ahead parameters (i.e. reading from a file at position
X normally causes reading several more pages, assuming the application
is likely to request those pages next).
- Make sure we're not reading past EOF.
- For each page in the range the user asked to read from:
- Look for the page in the page cache. If it is there, and
- If it's there, and is up-to-date:
- If we're not in 'non-blocking' mode (i.e. we're allowed to block),
invoke read-ahead.
- Copy data from the page to the user's buffer.
- Otherwise, use the address mapping 'readpage' operation to read the
page, and start over (unless we're in non-blocking mode).
References
- Linux Kernel 2.4 Internals - Virtual Filesystem (VFS)
- The Linux Virtual File-System Layer
- Linux Virtual File System (lecture slides from 1998).
- A Small Trail Through The Linux Kernel (code walk-through for open and read system calls).
Originally written by
guy keren