Why Yet Another File-System
- Why not?
- To get intimate knowledge with the VFS...
- ... so we can contribute to a "real" FS (ext2, reiserfs, nfs...)
- Because we think we have an idea for an even better FS.
- To expose something that's not an FS, as if it was one (e.g. ftpfs).
- To be able to prepare an unusual lecture for our local LUG.
- To leverage two month of unemployment for a future career...
General Parts Of A Disk-Based File-System's code
- First, there is the on-disk file-system layout design...
- ... and the code to manipulate it...
- ... including a user-mode utility to create a new, empty FS...
- ... and a user-space utility to check and fix a broken FS.
- Then there is code to handle the VFS's super-block.
- And code to handle Inodes of directories..
- ... and inodes of files.
- And code to handle file's data..
- Finally, code to handle directories data (mostly for "readdir").
Copy copy, luki luki...
- Before you write a new FS - read the code of an existing one...
- "ext2" is a good idea, since it is the simplest "native" file-system.
- Even thought FAT is much much simpler, it's not "native", and thus is
too confusing to serve as a reference (has no inodes, so everything is
emulated).
- Next time, you'll have the code of STAMFS as a base reference, too ;)
The Disk-Layout Of STAMFS
- Note: STAMFS is a silly, useless, unstable, crash-prone and
very inefficient file-system, that assumes that each disk sector contains
512 bytes, and uses them in logical blocks of 1024 bytes each.
- But it's simple enough to understand in 10 minutes and implement in 2
month of spare-time by someone unfamiliar with the VFS...
- Here is how it looks:
The Inode Index
- Contains a mapping from inode numbers to the logical disk block that
contain their info.
- Each entry on the disk occupies 4 bytes. Entry 0 contains the sector
number of inode 1.
- Inode numbers are between 1 (the root inode) and 1024/4 + 1 (=257).
- Then why 4 bytes per entry and not less? for simplicity, and future
extension.
The Disk-Layout Of A STAMFS Inode
- The inode contains info such as last update/change/access time, data
blocks size, index of the logical disk block with the data block index,
etc.
- The data block index contains a mapping between logical block numbers in
the inode, and logical blocks on disk.
- Entry 0 contains the number of the disk block whose contents is in block 0
of the file (bytes 0-1023).
- Each entry in this index is 4 bytes long.
- Thus, an inode (i.e. a file) on a STAMFS system is limited to 1024/4
(=256) sectors, or 128KB, of data
The Free Blocks List
- The free blocks list contains indexes of disk blocks that were once
allocated and then freed.
- No need to hold all free blocks, since we have the number of the highest
allocated block in the super-block.
- Each entry is 4 bytes long, and thus we have 1024/4 (256) entries for
free blocks.
- When allocating a data block, we first scan the free list. Only if it
is empty, we allocate the next highest non-allocated block.
- If we free 257 blocks without allocating any.... BOOM!
The Directory Index
- A directory is an inode of type "directory", whose data contains the list
of files found in the directory.
- Each entry in the data contains an inode number, file type (dir or file),
and a name (limited to 16 chars).
- We don't keep entries for "." and ".." on disk - the file system's code
adds these entires "on-the-fly" at run-time.
Writing STAMFS - A Loop With 8 Iterations
- When writing a file-system, we need to implement several layers of
"objects".
- To make life simpler, lets develop it using user-mode-linux...
- ... and using loopback mount ("file-system in a file").
- Lets see how to do that in 8 iterations.
- Important Note: the VFS is a very delicate creature. if you load
half-written file system code, it might try to access not-implemented
functions/structs and crash. The code in the steps below might cause
some crashes for the system - do not panic.
Registering A New File-System
- Use "register_filesystem()" to register a new file-system.
- A "struct file_system_type" is the parameter:
struct file_system_type stamfs_fstype = {
"stamfs", /* name of this file-system type. */
FS_REQUIRES_DEV, /* this is a disk-based file-system. */
stamfs_read_super, /* function that reads a STAMFS super-block. */
NULL /* module owning this file-system. */
};
- Use "unregister_filesystem()" during cleanup.
Here is the non-mountable STAMFS code.
Making It Mount-able
To make the file-system mountable, we implement the "stamfs_read_super()"
function:
- Read the super-block from the hard-disk.
- Verify the STAMFS signature is there.
- Initialize the "struct super_block" we got.
- Initialize a "struct stamfs_super_meta" and attach as the super-block's
FS private data.
- Load the root inode (see next slide), and put in the dcache.
Loading An Inode From Disk
The "iget()" function we called to load the root inode will eventually invoke
the read_inode super-block operation, stamfs_read_inode():
- Find the disk block containing the inode, based on the inode number.
- Read the inode's block.
- Initialize the VFS's inode struct with the loaded data.
- Attach STAMFS's inode-meta-data to the inode.
- Set the various VFS operations (inode-operations, file-operations,
address-space operations) based on the inode's type (dir, file...).
And "Cleanly" Un-mount-able
When a file-system is unmounted, its "put_super" super-block operation is
invoked. stamfs_put_super() does:
- Release the buffer_head-s we used for storing super-block related
data (the super-block, the inode index block and the free list block).
- Free the memory used for storing the STAMFS super-block's meta-data.
"Cleanly" Release Inodes
When we read the inode, we attached a meta-information structure to the inode
object. When the system frees this object, we need to clear this memory. This
is done by implementing the 'clear_inode' super-block operation:
- Get the pointer to the meta-data structure.
- Free it.
- Set the original pointer to NULL - the inode object might be re-used
later on...
Here is the just-mountable STAMFS code.
The Lookup Operation
Now we should allow searching for inodes inside directories, based on their
names - a simple "ls" won't work without stamfs_iop_lookup:
- Make sure the parameters are sane (e.g. file-name is not too long).
- Scan the directory for an entry with the desired name.
- If we found it - load the inode from disk, using the inode number
(see next slide).
- add the dentry as a child of the parent directory, in the dcache.
Reading The Contents Of A Directory
If we'll just stop here, we'll get a system crash in the first "ls" to the
root directory - we want to be able to read the contents of the root directory,
and this requires a file-operation named readdir, implemented in
stamfs_readdir():
- Make sure the file "read-head" position is not located past the last
entry in the directory (for a large dir, we only return part of it,
and the user will perform another readdir to read the next chunk, etc.).
- If we're at the first position, insert info about directory '.' (we don't
keep it on disk).
- If we're at the second position, insert info about directory '..' (we don't
keep it on disk).
- Read the directory's data block from disk.
- As long as there's space in the user's buffer, scan the list of files
and add them to the user's buffer.
- Update the last access-time of the directory.
Now, here is the search-able STAMFS code.
Note: expect a kernel-panic when bdflush tries to sync this
directory - we'll get back to this later...
Support Directory Creation
When we instantiated the root-inode, we attached an inode-operations object -
struct stamfs_dir_iops. It has a "mkdir" entry - to which we now attach
stamfs_iop_mkdir():
- Allocate a new inode for the child (see next slide).
- Turn this inode into an empty-directory inode.
- Increase the parent's link-count (the child's ".." points to the parent).
- Increase the child's link-count (the child's "." points back at the child).
- Add the child to the list of children of the parent.
- Instantiate the child's dentry.
Creating A New Inode
To create a new inode, we have stamfs_inode_new_ino():
- Allocate a block for the inode, and a block for the inode's block index.
- Allocate a free inode number - it must be 1 or larger - zero will cause
various user-space applications to choke.
- Allocate a VFS inode struct.
- Set the inode's link-count to 1 (it will be in a directory, which will
point to it).
- Set the rest of the inode struct's fields.
- Attach STAMFS's meta-data to the inode's private data.
- Set the various VFS operations (inode-operations, file-operations,
address-space operations) based on the inode's type (dir, file...).
- Add the inode to the parent directory.
Important - Writing The Super-Block
If we want to avoid a busy-loop, we must implement write-super:
- Adding a new inode will dirty the super-block...
- ... so the VFS (via the update daemon) will try to write it to disk...
- ... in a loop of updating all dirty super-blocks.
- The write-super function needs to mark the super-block as non-dirty, or
else the update daemon will keep trying to write it - causing the system
to get stuck.
So, here is the mkdir-able STAMFS code.
Support Directory Deletion
After we can create directories - we can try to delete them, too, with
stamfs_iop_rmdir():
- Make sure the directory is empty (it's up to US, not to the VFS. hence -
we could have changed this semantics if we wanted to).
- Unlink the directory (see next slide).
- Decrease the parent's link count - the child's ".." no longer points to it.
- Decrease the child's link count - the child's "." no longer points at
itself.
- Do not actually delete the directory's inode - the VFS will do it once
it sees the directory's link-count dropped to 0.
Support Directory Deletion - The Unlink Operation
Unlink is implemented as a separate operation, since later on it will also be
used to unlink files:
- Remove the child from the list of children of its parent.
- Decrease the child's link count - the parent no longer points to it.
Now, here is the rmdir-able STAMFS code.
Making Changes Persistent
Until now, when we unmounted our file-system, we lost all entires we created
(well, not all - when we wrote into buffer-heads, they got written into
disk - which is why we got some funny entries in directories, after un-mounting
and re-mounting a file system on which we created directories).
We need to have the following:
- Write the super-block to disk (already done).
- Write new or updated inodes to disk.
- Delete inodes from disk when their files (actually directories) are erased.
- Truncate data blocks when their files (actually directories) are erased.
Writing Inodes Back To Disk
Writing inodes back to disk is done using the super-block's write_inode
operation, and the stamfs_write_inode() function, which simply invokes
stamfs_inode_write_ino():
- Load the inode's block from disk.
- Update it with data from the VFS inode struct.
- Mark the inode's buffer_head as dirty - it'll be written by bdflush or
during unmount.
- Unless we were asked to do this synchronously - in which case we force
an immediate buffer_head write.
Deleting Inodes From Disk
Deleting inodes from disk is done using the super-block's delete_inode
operation, and the stamfs_delete_inode() function:
- Sanity check - make sure this is not a "bad inode" (i.e. an invalid inode).
- Set the inode's size field to zero, and if it contains any data blocks -
invoke stamfs_inode_truncate() (see below).
- Finally, free the inode, using stamfs_inode_free_ino() (which merely
frees the inode number, the inodes block and its block index block).
Truncating An Inode's Data Blocks
This function is used both when deleting the inode, and when a user invokes
a "truncate" operation (with a given offset) on a file, so it uses the
inode's size field in order to know where to start truncating from:
- Read the inode's block index from disk.
- Calculate the entry of this index, from which we need to fully truncate
blocks.
- Scan the index from that point until its end, freeing each block
(possibly adding it to the free list), and mark the entry as empty.
- Update accounting data (e.g. number of blocks used by the inode).
- Note: when we support files with data, we will need to get back to
this function.
Now, here is the persistent STAMFS code.
Creating Normal Files
First, we need to add new operation structs for files:
- In stamfs_iops, we add an empty stamfs_file_iops.
- In stamfs_fops, we add an empty stamfs_file_fops.
- Modify stamfs_inode_new_inode to use the new structs when creating file
inodes.
- Modify stamfs_inode_read_ino in the same manner.
The create Inode Operation
Creating a normal file is done using the "create" inode-operation, and in
STAMFS, with stamfs_iop_create():
- Create a new inode for the file (using the previously encountered
stamfs_inode_create_ino() function).
- Add it to children list the given directory, and instantiate the file's
dentry.
- If the above failed, reduce the new inode's link count and iput() it -
seeing that it has 0 links, the VFS will delete it (will invoke the
super-block's "delete_inode" operation).
The unlink Inode Operation
As an added bonus, let us also support deleting files. This is real easy now:
- Simply add an 'unlink' operation to the stamfs_dir_iops struct, pointing
to stamfs_iop_unlink. We already implemented that when adding support
for rmdir.
Here is the file-create-able STAMFS code.
Supporting Files With Data
Supporting files with data is simpler then it looks, because we leave most
of the work to the page cache, and the VFS's generic page-handling
functions.
We simply need to map all address-space operations to functions that invoke
the VFS generic page-handling functions, and implement a single function -
stamfs_get_block():
- Translate the block offset parameter to the matching disk block number.
- If we succeeded, just return this block number.
- Otherwise - if we were told NOT to allocate a new block, return an error.
- Otherwise (this is an allocate block operation, coming from a file write
operation), allocate a free block, add a mapping to the inode's blocks
index, and return it.
Truncating Pages In The Page Cache
In addition, we need to handle truncating page cache parts related to our
file, when the matching part of the file is truncated.
- In stamfs_inode_do_truncate, before the blocks freeing loop, we need
to tell the page cache to delete the relevant pages, using the VFS's
block_truncate_page function.
Syncing Files
Finally, we need to implement the fsync file operation, for both files
and directories. The function is stamfs_sync_file:
- Take the inode from the given dentry parameter.
- Sync the buffers related to the inode (in our case, that's the inode's
buffer-head and the block index buffer-head).
- Sync the data buffers of the file.
- If the inode itself is dirty, write the inode to disk in a synchronous
manner.
Here is the files-with-data STAMFS code.
Implementing A Disk-Based File-System - Summary
- Use the generic VFS functions whenever possible.
- Implement things in small steps.
- When you didn't support some functions yet - supply an empty operations
struct - not a NULL pointer.
- When updating a buffer_head or an inode - mark it as dirty!
- Whenever you change an inode - update the relevant times - no one
will do that for you.
- Updating VFS objects is decoupled from updating disk objects.
Anyone Wants Some Homework?
- Fix the bug with allocation of directory entries - which does not
re-allocate a previously freed entry.
- Allow creating larger files/directories, by having the last entry in an
inode's block index, point to another block containing index data.
- Support a larger free blocks list in a similar manner (note: This is
trickier - where do we store the blocks comprising the free blocks list?).
- Support having more inodes, in a similar manner.
- Add support for 'rename'.
- Add support for variable-length file names.
- Re-implement directories handling to use the page cache, rather then using
buffer-heads directly.
References
- Porting Filesystems from Kernel 2.4 to Kernel 2.6 - look in your Documentation/filesystems/porting directory of the kernel source tree.
- The extended-2 filesystem overview (written by a Technion fellow ;) )
- Virtual Filesystem: Building A Linux Filesystem From An Ordinary File - that's the way to go when developing a new file-system...
Originally written by
guy keren