Writing A Disk-Based File-System For The Linux Kernel V2.4

Why Yet Another File-System

Why not?
To get intimate knowledge with the VFS...
... so we can contribute to a "real" FS (ext2, reiserfs, nfs...)
Because we think we have an idea for an even better FS.
To expose something that's not an FS, as if it was one (e.g. ftpfs).
To be able to prepare an unusual lecture for our local LUG.
To leverage two month of unemployment for a future career...

General Parts Of A Disk-Based File-System's code

First, there is the on-disk file-system layout design...
... and the code to manipulate it...
... including a user-mode utility to create a new, empty FS...
... and a user-space utility to check and fix a broken FS.
Then there is code to handle the VFS's super-block.
And code to handle Inodes of directories..
... and inodes of files.
And code to handle file's data..
Finally, code to handle directories data (mostly for "readdir").

Copy copy, luki luki...

Before you write a new FS - read the code of an existing one...
"ext2" is a good idea, since it is the simplest "native" file-system.
Even thought FAT is much much simpler, it's not "native", and thus is too confusing to serve as a reference (has no inodes, so everything is emulated).
Next time, you'll have the code of STAMFS as a base reference, too ;)

The Disk-Layout Of STAMFS

Note: STAMFS is a silly, useless, unstable, crash-prone and very inefficient file-system, that assumes that each disk sector contains 512 bytes, and uses them in logical blocks of 1024 bytes each.
But it's simple enough to understand in 10 minutes and implement in 2 month of spare-time by someone unfamiliar with the VFS...
Here is how it looks:

The Inode Index

Contains a mapping from inode numbers to the logical disk block that contain their info.
Each entry on the disk occupies 4 bytes. Entry 0 contains the sector number of inode 1.
Inode numbers are between 1 (the root inode) and 1024/4 + 1 (=257).
Then why 4 bytes per entry and not less? for simplicity, and future extension.

The Disk-Layout Of A STAMFS Inode

The inode contains info such as last update/change/access time, data blocks size, index of the logical disk block with the data block index, etc.
The data block index contains a mapping between logical block numbers in the inode, and logical blocks on disk.
Entry 0 contains the number of the disk block whose contents is in block 0 of the file (bytes 0-1023).
Each entry in this index is 4 bytes long.
Thus, an inode (i.e. a file) on a STAMFS system is limited to 1024/4 (=256) sectors, or 128KB, of data

The Free Blocks List

The free blocks list contains indexes of disk blocks that were once allocated and then freed.
No need to hold all free blocks, since we have the number of the highest allocated block in the super-block.
Each entry is 4 bytes long, and thus we have 1024/4 (256) entries for free blocks.
When allocating a data block, we first scan the free list. Only if it is empty, we allocate the next highest non-allocated block.
If we free 257 blocks without allocating any.... BOOM!

The Directory Index

A directory is an inode of type "directory", whose data contains the list of files found in the directory.
Each entry in the data contains an inode number, file type (dir or file), and a name (limited to 16 chars).
We don't keep entries for "." and ".." on disk - the file system's code adds these entires "on-the-fly" at run-time.

Writing STAMFS - A Loop With 8 Iterations

When writing a file-system, we need to implement several layers of "objects".
To make life simpler, lets develop it using user-mode-linux...
... and using loopback mount ("file-system in a file").
Lets see how to do that in 8 iterations.
Important Note: the VFS is a very delicate creature. if you load half-written file system code, it might try to access not-implemented functions/structs and crash. The code in the steps below might cause some crashes for the system - do not panic.

Registering A New File-System

Use "register_filesystem()" to register a new file-system.

A "struct file_system_type" is the parameter:


     struct file_system_type stamfs_fstype = {
        "stamfs",          /* name of this file-system type.            */
        FS_REQUIRES_DEV,   /* this is a disk-based file-system.         */
        stamfs_read_super, /* function that reads a STAMFS super-block. */
        NULL               /* module owning this file-system.           */
     };

Use "unregister_filesystem()" during cleanup.

Here is the non-mountable STAMFS code.

Making It Mount-able

To make the file-system mountable, we implement the "stamfs_read_super()" function:

Read the super-block from the hard-disk.
Verify the STAMFS signature is there.
Initialize the "struct super_block" we got.
Initialize a "struct stamfs_super_meta" and attach as the super-block's FS private data.
Load the root inode (see next slide), and put in the dcache.

Loading An Inode From Disk

The "iget()" function we called to load the root inode will eventually invoke the read_inode super-block operation, stamfs_read_inode():

Find the disk block containing the inode, based on the inode number.
Read the inode's block.
Initialize the VFS's inode struct with the loaded data.
Attach STAMFS's inode-meta-data to the inode.
Set the various VFS operations (inode-operations, file-operations, address-space operations) based on the inode's type (dir, file...).

And "Cleanly" Un-mount-able

When a file-system is unmounted, its "put_super" super-block operation is invoked. stamfs_put_super() does:

Release the buffer_head-s we used for storing super-block related data (the super-block, the inode index block and the free list block).
Free the memory used for storing the STAMFS super-block's meta-data.

"Cleanly" Release Inodes

When we read the inode, we attached a meta-information structure to the inode object. When the system frees this object, we need to clear this memory. This is done by implementing the 'clear_inode' super-block operation:

Get the pointer to the meta-data structure.
Free it.
Set the original pointer to NULL - the inode object might be re-used later on...

Here is the just-mountable STAMFS code.

The Lookup Operation

Now we should allow searching for inodes inside directories, based on their names - a simple "ls" won't work without stamfs_iop_lookup:

Make sure the parameters are sane (e.g. file-name is not too long).
Scan the directory for an entry with the desired name.
If we found it - load the inode from disk, using the inode number (see next slide).
add the dentry as a child of the parent directory, in the dcache.

Reading The Contents Of A Directory

If we'll just stop here, we'll get a system crash in the first "ls" to the root directory - we want to be able to read the contents of the root directory, and this requires a file-operation named readdir, implemented in stamfs_readdir():

Make sure the file "read-head" position is not located past the last entry in the directory (for a large dir, we only return part of it, and the user will perform another readdir to read the next chunk, etc.).
If we're at the first position, insert info about directory '.' (we don't keep it on disk).
If we're at the second position, insert info about directory '..' (we don't keep it on disk).
Read the directory's data block from disk.
As long as there's space in the user's buffer, scan the list of files and add them to the user's buffer.
Update the last access-time of the directory.

Now, here is the search-able STAMFS code.
Note: expect a kernel-panic when bdflush tries to sync this directory - we'll get back to this later...

Support Directory Creation

When we instantiated the root-inode, we attached an inode-operations object - struct stamfs_dir_iops. It has a "mkdir" entry - to which we now attach stamfs_iop_mkdir():

Allocate a new inode for the child (see next slide).
Turn this inode into an empty-directory inode.
Increase the parent's link-count (the child's ".." points to the parent).
Increase the child's link-count (the child's "." points back at the child).
Add the child to the list of children of the parent.
Instantiate the child's dentry.

Creating A New Inode

To create a new inode, we have stamfs_inode_new_ino():

Allocate a block for the inode, and a block for the inode's block index.
Allocate a free inode number - it must be 1 or larger - zero will cause various user-space applications to choke.
Allocate a VFS inode struct.
Set the inode's link-count to 1 (it will be in a directory, which will point to it).
Set the rest of the inode struct's fields.
Attach STAMFS's meta-data to the inode's private data.
Set the various VFS operations (inode-operations, file-operations, address-space operations) based on the inode's type (dir, file...).
Add the inode to the parent directory.

Important - Writing The Super-Block

If we want to avoid a busy-loop, we must implement write-super:

Adding a new inode will dirty the super-block...
... so the VFS (via the update daemon) will try to write it to disk...
... in a loop of updating all dirty super-blocks.
The write-super function needs to mark the super-block as non-dirty, or else the update daemon will keep trying to write it - causing the system to get stuck.

So, here is the mkdir-able STAMFS code.

Support Directory Deletion

After we can create directories - we can try to delete them, too, with stamfs_iop_rmdir():

Make sure the directory is empty (it's up to US, not to the VFS. hence - we could have changed this semantics if we wanted to).
Unlink the directory (see next slide).
Decrease the parent's link count - the child's ".." no longer points to it.
Decrease the child's link count - the child's "." no longer points at itself.
Do not actually delete the directory's inode - the VFS will do it once it sees the directory's link-count dropped to 0.

Support Directory Deletion - The Unlink Operation

Unlink is implemented as a separate operation, since later on it will also be used to unlink files:

Remove the child from the list of children of its parent.
Decrease the child's link count - the parent no longer points to it.

Now, here is the rmdir-able STAMFS code.

Making Changes Persistent

Until now, when we unmounted our file-system, we lost all entires we created (well, not all - when we wrote into buffer-heads, they got written into disk - which is why we got some funny entries in directories, after un-mounting and re-mounting a file system on which we created directories).

We need to have the following:

Write the super-block to disk (already done).
Write new or updated inodes to disk.
Delete inodes from disk when their files (actually directories) are erased.
Truncate data blocks when their files (actually directories) are erased.

Writing Inodes Back To Disk

Writing inodes back to disk is done using the super-block's write_inode operation, and the stamfs_write_inode() function, which simply invokes stamfs_inode_write_ino():

Load the inode's block from disk.
Update it with data from the VFS inode struct.
Mark the inode's buffer_head as dirty - it'll be written by bdflush or during unmount.
Unless we were asked to do this synchronously - in which case we force an immediate buffer_head write.

Deleting Inodes From Disk

Deleting inodes from disk is done using the super-block's delete_inode operation, and the stamfs_delete_inode() function:

Sanity check - make sure this is not a "bad inode" (i.e. an invalid inode).
Set the inode's size field to zero, and if it contains any data blocks - invoke stamfs_inode_truncate() (see below).
Finally, free the inode, using stamfs_inode_free_ino() (which merely frees the inode number, the inodes block and its block index block).

Truncating An Inode's Data Blocks

This function is used both when deleting the inode, and when a user invokes a "truncate" operation (with a given offset) on a file, so it uses the inode's size field in order to know where to start truncating from:

Read the inode's block index from disk.
Calculate the entry of this index, from which we need to fully truncate blocks.
Scan the index from that point until its end, freeing each block (possibly adding it to the free list), and mark the entry as empty.
Update accounting data (e.g. number of blocks used by the inode).
Note: when we support files with data, we will need to get back to this function.

Now, here is the persistent STAMFS code.

Creating Normal Files

First, we need to add new operation structs for files:

In stamfs_iops, we add an empty stamfs_file_iops.
In stamfs_fops, we add an empty stamfs_file_fops.
Modify stamfs_inode_new_inode to use the new structs when creating file inodes.
Modify stamfs_inode_read_ino in the same manner.

The create Inode Operation

Creating a normal file is done using the "create" inode-operation, and in STAMFS, with stamfs_iop_create():
- Create a new inode for the file (using the previously encountered stamfs_inode_create_ino() function).
- Add it to children list the given directory, and instantiate the file's dentry.
- If the above failed, reduce the new inode's link count and iput() it - seeing that it has 0 links, the VFS will delete it (will invoke the super-block's "delete_inode" operation).
The unlink Inode Operation

As an added bonus, let us also support deleting files. This is real easy now:
- Simply add an 'unlink' operation to the stamfs_dir_iops struct, pointing to stamfs_iop_unlink. We already implemented that when adding support for rmdir.
Here is the file-create-able STAMFS code.

Supporting Files With Data

Supporting files with data is simpler then it looks, because we leave most of the work to the page cache, and the VFS's generic page-handling functions.
We simply need to map all address-space operations to functions that invoke the VFS generic page-handling functions, and implement a single function - stamfs_get_block():
- Translate the block offset parameter to the matching disk block number.
- If we succeeded, just return this block number.
- Otherwise - if we were told NOT to allocate a new block, return an error.
- Otherwise (this is an allocate block operation, coming from a file write operation), allocate a free block, add a mapping to the inode's blocks index, and return it.
Truncating Pages In The Page Cache

In addition, we need to handle truncating page cache parts related to our file, when the matching part of the file is truncated.
- In stamfs_inode_do_truncate, before the blocks freeing loop, we need to tell the page cache to delete the relevant pages, using the VFS's block_truncate_page function.
Syncing Files

Finally, we need to implement the fsync file operation, for both files and directories. The function is stamfs_sync_file:
- Take the inode from the given dentry parameter.
- Sync the buffers related to the inode (in our case, that's the inode's buffer-head and the block index buffer-head).
- Sync the data buffers of the file.
- If the inode itself is dirty, write the inode to disk in a synchronous manner.
Here is the files-with-data STAMFS code.

Implementing A Disk-Based File-System - Summary
- Use the generic VFS functions whenever possible.
- Implement things in small steps.
- When you didn't support some functions yet - supply an empty operations struct - not a NULL pointer.
- When updating a buffer_head or an inode - mark it as dirty!
- Whenever you change an inode - update the relevant times - no one will do that for you.
- Updating VFS objects is decoupled from updating disk objects.
Anyone Wants Some Homework?
- Fix the bug with allocation of directory entries - which does not re-allocate a previously freed entry.
- Allow creating larger files/directories, by having the last entry in an inode's block index, point to another block containing index data.
- Support a larger free blocks list in a similar manner (note: This is trickier - where do we store the blocks comprising the free blocks list?).
- Support having more inodes, in a similar manner.
- Add support for 'rename'.
- Add support for variable-length file names.
- Re-implement directories handling to use the page cache, rather then using buffer-heads directly.
References
1. Porting Filesystems from Kernel 2.4 to Kernel 2.6 - look in your Documentation/filesystems/porting directory of the kernel source tree.
2. The extended-2 filesystem overview (written by a Technion fellow ;) )
3. Virtual Filesystem: Building A Linux Filesystem From An Ordinary File - that's the way to go when developing a new file-system...
Originally written by guy keren