% Linux Kernel Module Programming Guide -*- TeX -*- % Copyright (C) 1998 by Ori Pomerantz % % This file is freely redistributable, but you must preserve this copyright % notice on all copies, and it must be distributed only as part of "Linux % Kernel Module Programming Guide". This file's use is covered by the % copyright for the entire document, in the file "copyright.tex". % % m4 Macro for a source file (I use m4 so I can include a file within % verbatim mode). define(`sourcesample', ` \vskip 2 ex \addcontentsline{toc}{section}{$1} {\large\bf $1} \index{$1, source file}\index{source\\$1} \begin{verbatim} include(../source/$2/$1) \end{verbatim} ') % Document specific definitions \newcommand{\myversion}{1.0} \newcommand{\myyear}{1998} \newcommand{\mydate}{23 November \myyear} \newcommand{\bookname}{Linux Kernel Module Programming Guide} % Author dependant definitions \newcommand{\myemail}{ori@hishome.net} \newcommand{\myname}{Ori Pomerantz} \newcommand{\myaddress}{\myname\\ Apt. \#1032\\ 2355 N Hwy 360\\ Grand Prairie\\ TX 75050\\ USA} \typeout{ * \bookname, \myemail} \typeout{ * Version \myversion, \mydate.} % Conditional flags. Set these based on how you are formatting the book. % For Slackware edition: \def\igsslack{1} % For plain ASCII edition: %\def\igsascii{0} % The style of the document \documentstyle[times,indentfirst,epsfig,twoside,linuxdoc,lotex]{report} % We WANT an index. \makeindex % Set title information. \title{\bookname} \years{\myyear} % \author{\large \myname} \abstract{Version \myversion, \mydate. \\ \vskip 1ex This book is about writing Linux Kernel Modules. It is, hopefully, useful for programmers who know C and want to learn how to write kernel modules. It is written as an ``How-To'' instruction manual, with examples of all of the important techniques. \\ \vskip 1ex Although this book touches on many points of kernel design, it is not supposed to fulfill that need --- there are other books on this subject, both in print and in the Linux documentation project. \\ \vskip 1ex You may freely copy and redistribute this book under certain conditions. Please see the copyright and distribution statement.} % this is a 'special' for dvips \special{papersize=7in,9in} \setlength\paperwidth {7in} \setlength\paperheight {9in} % Table of content \setcounter{secnumdepth}{5} \setcounter{tocdepth}{2} % Initially, roman numbering with no numbers. \pagenumbering{roman} \pagestyle{empty} \sloppy %% %% end of preamble %% \begin{document} % \raggedbottom \setlength{\parskip}{0pt} %remove space between paragraphs \maketitle include(copyright.m4) \setcounter{page}{0} \pagestyle{headings} \tableofcontents % I like my introductions to be chapter zero. \setcounter{chapter}{-1} % No more Roman Numerals for me! \pagenumbering{arabic} \pagestyle{empty} % At last, the REAL beginning \chapter{Introduction}\label{introduction} So, you want to write a kernel module. You know C, you've written a number of normal programs to run as processes, and now you want to get to where the real action is, to where a single wild pointer can wipe out your file system and a core dump means a reboot. Well, welcome to the club. I once had a wild pointer wipe an important directory under DOS (thankfully, now it stands for the {\bf D}ead {\bf O}perating {\bf S}ystem), and I don't see why living under Linux should be any safer. \index{DOS} {\bf Warning:} I wrote this and checked the program under version 2.0.35 of the kernel running on a Pentium. For the most part, it should work on other CPUs and on other versions of the kernel, as long as they are 2.0.x or above, but I can't promise anything. One exception is chapter \ref{int-handler}, which should not work on any other architecture. \section{Who Should Read This}\label{who-should-read} This document is for people who want to write kernel modules. Although I will touch on how things are done in the kernel in several places, that is not my purpose. There are enough good sources which do a better job than I could have done. The kernel is a great piece of programming, and I believe that programmers should read atleast some kernel source files and understand them. Having said that, I also believe in the value of playing with the system first and asking questions later. When I learn a new programming language, I don't start with reading the library code, but by writing a small ``hello, world'' program. I don't see why playing with the kernel should be any different. \section{Note on the Style}\label{style-note} I like to put as many jokes as possible into my documentation. I'm writing this because I enjoy it, and I assume most of you are reading this for the same reason. If you just want to get to the point, ignore all the normal text and read the source code. I promise to put all the important details in remarks. \section{Acknowledgements}\label{acknowledgments} I'd like to thank Yoav Weiss for many helpful ideas and discussions, as well as for finding mistakes within this document before its publication. Of course, any remaining mistakes are purely my fault. The \TeX \ skeleton for this book was shamelessly stolen from the ``Linux Installation and Getting Started'' guide, where the \TeX \ work was done by Matt Welsh. My gratitude to Richard Stallman, Linus Torvalds and all the other people who made it possible for me to run a high quality operating system on my computer and get the source code goes without saying (yeah, right --- then why did I say it?). \chapter{Hello, world}\label{hello-world} When the first caveman programmer chiseled the first program on the walls of the first cave computer, it was a program to paint the string ``Hello, world'' in Antilope pictures. Roman programming text--books began with the ``Salut, Mundi'' program. I don't know what happens to people who break with this tradition, and I think it's safer not to find out. \index{hello world} \index{salut mundi} A kernel module has to have atleast two functions: {\tt init\_module} which is called when the module is inserted into the kernel, and {\tt cleanup\_module} which is called just before it is removed. Typically, {\tt init\_module} either registers a handler for something with the kernel, or it replaces one of the kernel function with its own code (usually code to do something and then call the original function). The {\tt cleanup\_module} function is supposed to undo whatever {\tt init\_module} did, so the module can be unloaded safely. \index{init\_module} \index{cleanup\_module} sourcesample(hello.c, 01_hello) sourcesample(Makefile, 01_hello) So, now the only thing left is to {\tt su} to root (you didn't compile this as root, did you? Living on the edge\dots), and then {\tt insmod hello} and {\tt rmmod hello} to your heart's content. While you do it, notice your new kernel module in {\tt /proc/modules}. \index{insmod} \index{rmmod} \index{/proc/modules} \index{root} By the way, the reason why the Makefile recommends against doing {\tt insmod} from X is because when the kernel has a message to print with {\tt printk}, it sends it to the console. When you don't use X, it just goes to the virtual terminal you're using (the one you chose with Alt-F$<$n$>$) and you see it. When you do use X, on the other hand, there are two possibilities. Either you have a console open with {\tt xterm -C}, in which case the output will be sent there, or you don't, in which case the output will go to virtual terminal 7 --- the one "covered" by X. \index{X\\why you should avoid} \index{xterm -C} \index{console} \index{virtual terminal} \index{terminal\\virtual} \index{printk} If your kernel becomes unstable you're likelier to get the debug messages without X. Outside of X, {\tt printk}'s go directly from the kernel to the console. In X, on the other hand, {\tt printk}'s go to a user mode process ({\tt xterm -C}). When that process receives CPU time, it is supposed to send it to the X server process. Then, when the X server receives the CPU, it is supposed to display it --- but an unstable kernel usually means that the system is about to crash or reboot, so you don't want to delay the error messages, which might explain to you what went wrong, for longer than you have to. \chapter{Character Device Files}\label{char-dev-file} \index{character device files} \index{device files\\character} So, now we're bold kernel programmers and we know how to write kernel modules to do nothing. We feel proud of ourselves and we hold our heads up high. But somehow we get the feeling that something is missing. Catatonic modules are not much fun. There are two major ways for a kernel module to talk to processes. One is through device files (like the files in the {\tt /dev} directory), the other is to use the proc file system. Since one of the major reasons to write something in the kernel is to support some kind of hardware device, we'll begin with device files. \index{/dev} The original purpose of device files is to allow processes to communicate with device drivers in the kernel, and through them with physical devices (modems, terminals, etc.). The way this is implemented is the following. \index{devices\\physical} \index{physical devices} \index{modem} \index{terminal} Each device driver, which is responsible for some type of hardware, is assigned its own major number. The list of drivers and their major numbers is available in {\tt /proc/devices}. Each physical device managed by a device driver is assigned a minor number. The {\tt /dev} directory is supposed to include a special file, called a device file, for each of those devices, whether or not it's really installed on the system. \index{major number} \index{number\\major (of device driver)} \index{minor number} \index{number\\major (of physical device)} For example, if you do {\tt ls -l /dev/hd[ab]*}, you'll see all of the IDE hard disk partitions which might be connected to a machine. Notice that all of them the same major number, 3, but the minor number changes from one to the other {\em $<$Disclaimer: This assumes you're using a PC architecture. I don't know about devices on Linux running on other architectures$>$}. \index{IDE\\hard disk} \index{partition\\of hard disk} \index{hard disk\\partitions of} When the system was installed, all of those device files were created by the {\tt mknod} command. There's no technical reason why they have to be in the {\tt /dev} directory, it's just a useful convention. When creating a device file for testing purposes, as with the exercise here, it would probably make more sense to place it in the directory where you compile the kernel module. \index{mknod} \index{/dev} Devices are divided into two types: character devices, which are devices which only support sequential access and block devices, which support seek operations\footnote{A seek operation tells a physical device to go to information at a particular location.}. Most devices in the world are character, because they can only give their output or receive their input in sequence. The main use of block devices is for storage devices, which support random access (imagine if you had to read everything on your 2 GB hard-disk just to get to the file located in the last sector...). You can tell whether a device file is for a block device or a character device by looking at the first character in the output of {\tt ls -l}. If it's ``b'' then it's a block device, and if it's ``c'' then it's a character device. \index{device files\\character} \index{device files\\block} \index{seek operation} \index{operation\\seek} \index{sequential access} \index{access\\sequential} This module is divided into two separate parts: The module part which registers the device and the device driver part. The {\tt init\_module} function calls {\tt module\_register\_chrdev} to add the device driver to the kernel's character device driver table. It also returns the major number to be used for the driver. The {\tt cleanup\_module} function deregisters the device. \index{module\_register\_chrdev} \index{major device number} \index{device number\\major} This (registering something and unregistering it) is the general functionality of those two functions. Things in the kernel don't run on their own initiative, like processes, but are called, by processes via system calls, or by hardware devices via interrupts, or by other parts of the kernel (simply by calling specific functions). As a result, when you add code to the kernel, you're supposed to register it as the handler for a certain type of event and when you remove it, you're supposed to unregister it. \index{init\_module\\general purpose} \index{cleanup\_module\\general purpose} The device driver proper is composed of the four device\_$<$action$>$ functions, which are called when somebody tries to do something with a device file which has our major number. The way the kernel knows to call them is via the {\tt file\_operations} structure, {\tt Fops}, which was given when the device was registered, which includes pointers to those four functions. \index{file\_operations structure} \index{struct file\_operations} Another point we need to remember here is that we can't allow the kernel module to be {\tt rmmod}ed whenever root feels like it. The reason is that if the device file is opened by a process and then we remove the kernel module, using the file would cause a call to the memory location where the appropriate function (read/write) used to be. If we're lucky, no other code was loaded there, and we'll get an ugly error message. If we're unlucky, another kernel module was loaded into the same location, which means a jump into the middle of another function within the kernel. The results of this would be impossible to predict, but they can't be positive. \index{rmmod\\preventing} Normally, when you don't want to allow something, you return an error code (a negative number) from the function which is supposed to do it. With {\tt cleanup\_module} that is impossible because it's a void function. Once {\tt cleanup\_module} is called, the module is dead. However, there is a use counter which counts how many other kernel modules are using this kernel module, called the reference count (that's the last number of the line in {\tt /proc/modules}). If this number isn't zero, {\tt rmmod} will fail. The module's reference count is available in the variable {\tt mod\_use\_count\_}. Since there are macros defined for handling this variable ({\tt MOD\_INC\_USE\_COUNT} and {\tt MOD\_DEC\_USE\_COUNT}), we prefer to use them, rather than {\tt mod\_use\_count\_} directly, so we'll be safe if the implementation changes in the future. \index{/proc/modules} \index{reference count} \index{mod\_use\_count\_} \index{cleanup\_module} \index{MOD\_INC\_USE\_COUNT} \index{MOD\_DEC\_USE\_COUNT} sourcesample(chardev.c, 02_chardev) \chapter{The /proc File System}\label{proc-fs} \index{proc file system} \index{/proc file system} \index{file system\\/proc} In Linux there is an additional mechanism for the kernel, and kernel modules, to send information to processes --- the {\tt /proc} file system. Originally designed to allow easy access to information about processes (hence the name), it is now used by every bit of the kernel which has something interesting to report, such as {\tt /proc/modules} which has the list of modules and {\tt /proc/meminfo} which has memory usage statistics. \index{/proc/modules} \index{/proc/meminfo} The method to use the proc file system is very similair to the one used with device drivers --- you create a structure with all the information needed for the {\tt /proc} file, including pointers to any handler functions (in our case there is only one, the one called when somebody attempts to read from the {\tt /proc} file). Then, {\tt init\_module} registers the structure with the kernel and {\tt cleanup\_module} unregisters it. The reason we use {\tt proc\_register\_dynamic} is because we don't want to determine the inode number used for our file in advance, but to allow the kernel to determine it to prevent clashes. Normal file systems are located on a disk, rather than just in memory (which is where {\tt /proc} is), and in that case the inode number is a pointer to a disk location where the file's index-node (inode for short) is located. The inode contains information about the file, for example the file's permissions, together with a pointer to the disk location or locations where the file's data can be found. \index{proc\_register\_dynamic} \index{inode} Because we don't get called when the file is opened or closed, there's no where for us to put {\tt MOD\_INC\_USE\_COUNT} and {\tt MOD\_DEC\_USE\_COUNT} in this module, and if the file is opened and then the module is removed, there's no way to avoid the consequences. In the next chapter we'll see a harder to implement, but more flexible, way of dealing with {\tt /proc} files which will allow us to protect against this problem as well. sourcesample(procfs.c, 03_procfs) \chapter{Using /proc For Input}\label{proc-input} \index{Input\\using /proc for} \index{/proc\\using for input} \index{proc\\using for input} So far we have two ways to generate output from kernel modules: we can register a device driver and {\tt mknod} a device file, or we can create a {\tt /proc} file. This allows the kernel module to tell us anything it likes. The only problem is that there is no way for us to talk back. The first way we'll send input to kernel modules will be by writing back to the {\tt /proc} file. Because the proc filesystem was written mainly to allow the kernel to report its situation to processes, there are no special provisions for input. The {\tt proc\_dir\_entry} struct doesn't include a pointer to an input function, the way it includes a pointer to an output function. Instead, to write into a {\tt /proc} file, we need to use the standard filesystem mechanism. \index{proc\_dir\_entry structure} \index{struct proc\_dir\_entry} In Linux there is a standard mechanism for file system registration. Since every file system has to have its own functions to handle inode and file operations\footnote{The difference between the two is that file operations deal with the file itself, and inode operations deal with ways of referencing the file, such as creating links to it.}, there is a special structure to hold pointers to all those functions, {\tt struct inode\_operations}, which includes a pointer to {\tt struct file\_operations}. In /proc, whenever we register a new file, we're allowed to specify which {\tt struct inode\_operations} will be used for access to it. This is the mechanism we use, a {\tt struct inode\_operations} which includes a pointer to a {\tt struct file\_operations} which includes pointers to our {\tt module\_input} and {\tt module\_output} functions. \index{file system registration} \index{registration\\file system} \index{struct inode\_operations} \index{inode\_operations structure} \index{struct file\_operations} \index{file\_operations structure} It's important to note that the standard roles of read and write are reversed in the kernel. Read functions are used for output, whereas write functions are used for input. The reason for that is that read and write refer to the user's point of view --- if a process reads something from the kernel, then the kernel needs to output it, and if a process writes something to the kernel, then the kernel receives it as input. \index{read\\in the kernel} \index{write\\in the kernel} Another interesting point here is the {\tt module\_permission} function. This function is called whenever a process tries to do something with the {\tt /proc} file, and it can decide whether to allow access or not. Right now it is only based on the operation and the uid of the current used (as available in {\tt current}, pointer to a structure which includes information on the currently running process), but it could be based on anything we like, such as what other processes are doing with the same file, the time of day, or the last input we received. \index{module\_permissions} \index{permissions} \index{current pointer} \index{pointer\\current} The reason for {\tt put\_user} and {\tt get\_user} is that Linux memory (under Intel architecture, it may be different under some other processors) is segmented. This means that a pointer, by itself, does not reference a unique location in memory, only a location in a memory segment, and you need to know which memory segment it is to be able to use it. There is one memory segment for the kernel, and one of each of the processes. \index{put\_user} \index{get\_user} \index{memory segments} \index{segment\\memory} The only memory segment accessible to a process is its own, so when writing regular programs to run as processes, there's no need to worry about segments. When you write a kernel module, normally you want to access the kernel memory segment, which is handled automatically by the system. However, when the content of a memory buffer needs to be passed between the currently running process and the kernel, the kernel function receives a pointer to the memory buffer which is in the process segment. The {\tt put\_user} and {\tt get\_user} macros allow you to access that memory. sourcesample(procfs.c, 04_procfs2) \chapter{Talking to Device Files (writes and IOCTLs)}\label{dev-input} \index{device files\\input to} \index{input to device files} \index{ioctl} \index{write\\to device files} Device files are supposed to represent physical devices. Most physical devices are used for output as well as input, so there has to be some mechanism for device drivers in the kernel to get the output to send to the device from processes. This is done by opening the device file for output and writing to it, just like writing to a file. In the following example, this is implemented by {\tt device\_write}. This is not always enough. Imagine you had a serial port connected to a modem (even if you have an internal modem, it is still implemented from the CPU's perspective as a serial port connected to a modem, so you don't have to tax your imagination too hard). The natural thing to do would be to use the device file to write things to the modem (either modem commands or data to be sent through the phone line) and read things from the modem (either responses for commands or the data received through the phone line). However, this leaves open the question of what to do when you need to talk to the serial port itself, for example to send the rate at which data is sent and received. \index{serial port} \index{modem} The answer in Unix is to use a special function called {\tt ioctl} (short for {\bf i}nput {\bf o}utput {\bf c}on{\bf t}ro{\bf l}). Every device can have its own {\tt ioctl} commands, which can be read {\tt ioctl}'s (to send information from a process to the kernel), write {\tt ioctl}'s (to return information to a process), \footnote{Notice that here the roles of read and write are reversed {\em again}, so in {\tt ioctl}'s read is to send information to the kernel and write is to receive information from the kernel.} both or neither. The ioctl function is called with three parameters: the file descriptor of the appropriate device file, the ioctl number, and a parameter, which is of type long so you can use a cast to use it to pass anything. \footnote{This isn't exact. You won't be able to pass a structure, for example, through an ioctl --- but you will be able to pass a pointer to the structure.} The ioctl number encodes the major device number, the type of the ioctl, the command, and the type of the parameter. This ioctl number is usually created by a macro call ({\tt \_IO}, {\tt \_IOR}, {\tt \_IOW} or {\tt \_IOWR} --- depending on the type) in a header file. This header file should then be {\tt \#include}'d both by the programs which will use {\tt ioctl} (so they can generate the appropriate {\tt ioctl}'s) and by the kernel module (so it can understand it). In the example below, the header file is {\tt chardev.h} and the program which uses it is {\tt ioctl.c}. \index{\_IO} \index{\_IOR} \index{\_IOW} \index{\_IOWR} If you want to use {\tt ioctl}'s in your own kernel modules, it is best to receive an official {\tt ioctl} assignment, so if you accidentaly get somebody else's {\tt ioctl}'s, or if they get yours, you'll know something is wrong. For more information, consult the kernel source tree at {\tt ``Documentation/ioctl-number.txt''}. \index{official ioctl assignment} \index{ioctl\\official assignment} sourcesample(chardev.c, 05_devrw) sourcesample(chardev.h, 05_devrw) \index{ioctl\\defining} \index{defining ioctls} \index{ioctl\\header file for} \index{header file for ioctls} sourcesample(ioctl.c, 05_devrw) \index{ioctl\\using in a process} \chapter{Startup Parameters}\label{startup-param} \index{startup parameters} \index{parameters\\startup} In many of the previous examples, we had to hard-wire something into the kernel module, such as the file name for {\tt /proc} files or the major device number for the device so we can have {\tt ioctl}'s to it. This goes against the grain of the Unix, and Linux, philosophy which is to write flexible program the user can customize. \index{hard wiring} The way to tell a program, or a kernel module, something it needs before it can start working is by command line parameters. In the case of kernel modules, we don't get {\tt argc} and {\tt argv} --- instead, we get something better. We can define global variables in the kernel module and {\tt insmod} will fill them for us. \index{argc} \index{argv} In this kernel module, we define two of them: {\tt str1} and {\tt str2}. All you need to do is compile the kernel module and then run {\tt insmod str1=xxx str2=yyy}. When {\tt init\_module} is called, {\tt str1} will point to the string {\tt ``xxx''} and {\tt str2} to the string {\tt ``yyy''}. \index{insmod} {\bf Warning:} There is no type checking on these arguments\footnote{There can't be, since under C the object file only has the location of global variables, not their type.}. If the first character of {\tt str1} or {\tt str2} is a digit the kernel will fill the variable with the value of the integer, rather than a pointer to the string. If a real life situation you have to check for this. sourcesample(param.c, 06_params) \chapter{System Calls}\label{sys-call} \index{system calls} \index{calls\\system} So far, the only thing we've done was to use well defined kernel mechanisms to register {\tt /proc} files and device handlers. This is fine if you want to do something the kernel programmers thought you'd want, such as write a device driver. But what if you want to do something unusual, to change the behavior of the system in some way? Then, you're mostly on your own. This is where kernel programming gets dangerous. While writing the example below, I killed the {\tt open} system call. This meant I couldn't open any files, I couldn't run any programs, and I couldn't {\tt shutdown} the computer. I had to pull the power switch. Luckily, no files died. To ensure you won't lose any files either, please run {\tt sync} right before you do the {\tt insmod} and the {\tt rmmod}. \index{sync} \index{insmod} \index{rmmod} \index{shutdown} Forget about {\tt /proc} files, forget about device files. They're just minor details. The {\em real} process to kernel communication mechanism, the one used by all processes, is system calls. When a process requests a service from the kernel (such as opening a file, forking to a new process, or requesting more memory), this is the mechanism used. If you want to change the behaviour of the kernel in interesting ways, this is the place to do it. By the way, if you want to see which system calls a program uses, run {\tt strace }. \index{strace} In general, a process is not supposed to be able to access the kernel. It can't access kernel memory and it can't call kernel functions. The hardware of the CPU enforces this (that's the reason why it's called ``protected mode''). System calls are an exception to this general rule. What happens is that the process fills the registers with the appropriate values and then calls a special instruction which jumps to a previously defined location in the kernel (of course, that location is readable by user processes, it is not writable by them). Under Intel CPUs, this is done by means of interrupt 0x80. The hardware knows that once you jump to this location, you are no longer running in restricted user mode, but as the operating system kernel --- and therefore you're allowed to do whatever you want. \index{interrupt 0x80} The location in the kernel a process can jump to is called {\tt system\_call}. The procedure at that location checks the system call number, which tells the kernel what service the process requested. Then, it looks at the table of system calls ({\tt sys\_call\_table}) to see the address of the kernel function to call. Then it calls the function, and after it returns, does a few system checks and then return back to the process (or to a different process, if the process's time ran out). If you want to read this code, it's at the source file {\tt arch/$<$architecture$>$/kernel/entry.S}, after the line {\tt ENTRY(system\_call)}. \index{system\_call} \index{ENTRY(system\_call)} \index{sys\_call\_table} \index{entry.S} So, if we want to change the way a certain system call functions, what we need to do is to write our own function to implement it (usually by adding a bit of our own code, and then calling the original function) and then change the pointer at {\tt sys\_call\_table} to point to our function. Because we might be removed later and we don't want to leave the system in an unstable state, it's important for {\tt cleanup\_module} to restore the table to its original state. The source code here is an example of such a kernel module. We want to ``spy'' on a certain user, and to {\tt printk} a message whenever that user opens a file. Towards this end, we replace the system call to open a file with our own function, called {\tt our\_sys\_open}. This function checks the uid (user's id) of the current process, and if it's equal to the uid we spy on, it calls {\tt printk} to display the name of the file to be opened. Then, either way, it calls the original {\tt open} function with the same parameters, to actually open the file. \index{open\\system call} The {\tt init\_module} function replaces the appropriate location in {\tt sys\_call\_table} and keeps the original pointer in a variable. The {\tt cleanup\_module} function uses that variable to restore everything back to normal. This approach is dangerous, because of the possibility of two kernel modules changing the same system call. Imagine we have two kernel modules, A and B. A's open system call will be A\_open and B's will be B\_open. Now, when A is inserted into the kernel, the system call is replaced with A\_open, which will call the original sys\_open when it's done. Next, B is inserted into the kernel, which replaces the system call with B\_open, which will call what it thinks is the original system call, A\_open, when it's done. Now, if B is removed first, everything will be well --- it will simply restore the system call to A\_open, which calls the original. However, if A is removed and then B is removed, the system will crash. A's removal will restore the system call to the original, sys\_open, cutting B out of the loop. Then, when B is removed, it will restore the system call to what {\bf it} thinks is the original, A\_open, which is no longer in memory. At first glance, it appears we could solve this particular problem by checking if the system call is equal to our open function and if so not changing it at all (so that B won't change the system call when it's removed), but that will cause an even worse problem. When A is removed, it sees that the system call was changed to B\_open so that it is no longer pointing to A\_open, so it won't restore it to sys\_open before it is removed from memory. Unfortunately, B\_open will still try to call A\_open which is no longer there, so that even without removing B the system would crash. The only way I can think of to prevent this problem is to restore the call to the original value, sys\_open. Unfortunately, sys\_open is not part of the kernel system table in {\tt /proc/ksyms}, so we can't access it. If anybody has a better idea, I'd be happy to hear it\footnote{Of course, the best idea is {\bf not} to change system calls. This sort of hacking should be considered a last ditch mechanism, used when there is no other way to accomplish something. I include this technique here because it's fun and flexible --- not because I recommend you actually use it in production environments.}. sourcesample(syscall.c, 07_syscall) \chapter{Blocking Processes}\label{blocks} \index{blocking processes} \index{processes\\blocking} What do you do when somebody asks you for something you can't do right away? If you're a human being and you're bothered by a human being, the only thing you can say is: ``Not right now, I'm busy. {\em Go away!}''. But if you're a kernel module and you're bothered by a process, you have another possibility. You can put the process to sleep until you can service it. After all, processes are being put to sleep by the kernel and woken up all the time (that's the way multiple processes appear to run on the same time on a single CPU). \index{multi tasking} \index{busy} This kernel module is an example of this. The file (called {\tt /proc/sleep}) can only be opened by a single process at a time. If the file is already open, the kernel module calls {\tt module\_interruptible\_sleep\_on}\footnote{The easiest way to keep a file open is to open it with {\tt tail -f}.}. This function changes the status of the task (a task is the kernel data structure which holds information about a process and the system call it's in, if any) to {\tt TASK\_INTERRUPTIBLE}, which means that the task will not run until it is woken up somehow, and adds it to {\tt WaitQ}, the queue of tasks waiting to access the file. Then, the function calls the schedualer to context switch to a different process, one which has some use for the CPU. \index{module\_interruptibe\_sleep\_on} \index{interruptibe\_sleep\_on} \index{TASK\_INTERRUPTIBLE} \index{sleep\\putting processes to} \index{processes\\putting to sleep} \index{putting processes to sleep} \index{task structure} \index{structure\\task} When a process is done with the file, it closes it, and {\tt module\_close} is called. That function wakes up all the processes in the queue (there's no mechanism to only wake up one of them). It then returns and the process which just closed the file can continue to run. In time, the schedualer decides that that process has had enough and gives control of the CPU to another process. Eventually, one of the processes which was in the queue will be given control of the CPU by the schedualer. It starts at the point right after the call to {\tt module\_interruptible\_sleep\_on} \footnote{This means that the process is still in kernel mode --- as far as the process is concerned, it issued the {\tt open} system call and the system call hasn't returned yet. The process doesn't know somebody else used the CPU for most of the time between the moment it issued the call and the moment it returned} . It can then proceed to set a global variable to tell all the other processes that the file is still open and go on with its life. When the other processes get a piece of the CPU, they'll see that global variable and go back to sleep. \index{waking up processes} \index{processes\\waking up} \index{multitasking} \index{schedualer} To make our life more interesting, {\tt module\_close} doesn't have a monopoly on waking up the processes which wait to access the file. A signal, such as Ctrl-C ({\tt SIGINT}) can also wake up a process\footnote{This is because we used {\tt module\_interruptible\_sleep\_on}. We could have used {\tt module\_sleep\_on} instead, but that would have resulted is extremely angry users whose control C's are ignored.}. In that case, we want to return with {\tt -EINTR} immediately. This is important so users can, for example, kill the process before it receives the file. \index{module\_wake\_up} \index{signal} \index{SIGINT} \index{ctrl-c} \index{EINTR} \index{processes\\killing} \index{module\_sleep\_on} \index{sleep\_on} There is one more point to remember. Some times processes don't want to sleep, they want either to get what they want immediately, or to be told it cannot be done. Such processes use the {\tt O\_NONBLOCK} flag when opening the file. The kernel is supposed to respond by returning with the error code {\tt -EAGAIN} from operations which would otherwise block, such as opening the file in this example. The program cat\_noblock, available in the source directory for this chapter, can be used to open a file with {\tt O\_NONBLOCK}. \index{O\_NONBLOCK} \index{non blocking} \index{blocking, how to avoid} \index{EAGAIN} sourcesample(sleep.c, 08_sleep) \chapter{Replacing printk's}\label{printk} \index{printk\\replacing} \index{replacing printk's} In the beginning (chapter \ref{hello-world}), I said that X and kernel module programming don't mix. That's true while developing the kernel module, but in actual use you want to be able to send messages to the whatever tty\footnote{{\bf T}ele{\bf ty}pe, originally a combination keyboard--printer used to communicate with a Unix system, and today an abstraction for the text stream used for a Unix program, whether it's a physical terminal, an xterm on an X display, a network connection used with telnet, etc.} the command to the module came from. This is important for identifying errors after the kernel module is released, because it will be used through all those. The way this is done is by using {\tt current}, a pointer to the currently running task, to get the current task's tty structure. Then, we look inside that tty structure to find a pointer to a string write function, which we use to write a string to the tty. \index{current task}\index{task\\current} \index{tty\_struct}\index{struct\\tty} sourcesample(printk.c, 09_printk) \chapter{Schedualing Tasks}\label{sched} \index{schedualing tasks}\index{tasks\\schedualing} Very often, we have ``house keeping'' tasks which have to be done at a certain time, or every so often. If the task is to be done by a process, we do it by putting it in the crontab. If the task is to be done by a kernel module, we have two possibilities. The first is to put a process in crontab which will wake up the module by a system call when necessary, for example by opening a file. This is terriably inefficient, however --- we run a new process off of crontab, read a new executable to memory, and all this just to wake up a kernel module which is in memory anyway. \index{house keeping}\index{crontab} Instead of doing that, we can create a function that will be called once for every timer interrupt. The way we do this is we create a task, held in a {\tt struct tq\_struct}, which will hold a pointer to the function. Then, we use {\tt queue\_task} to put that task on a task list called {\tt tq\_timer}, which is the list of tasks to be executed on the next timer interrupt. Because we want the function to keep on being executed, we need to put it back on {\tt tq\_timer} whenever it is called, for the next timer interrupt. \index{struct tq\_struct}\index{tq\_struct struct} \index{queue\_task} \index{task} \index{tq\_timer} There's one more point we need to remember here. When a module is removed by {\tt rmmod}, first its reference count is checked. If it is zero, {\tt module\_cleanup} is called. Then, the module is removed from memory with all its functions. Nobody checks to see if the timer's task list happens to contain a pointer to one of those functions, which will no longer be available. Ages later (from the computer's perspective, from a human perspective it's nothing, less than $\frac{1}{100}$ seconds), the kernel has a timer interrupt and tries to call the function on the task list. Unfortunately, the function is no longer there. In most cases, the memory page where it sat is unused, and you get an ugly error message. But if some other code is now sitting at the same memory location, it could get {\bf very} ugly. Unfortunately, we don't have an easy way to unregister a task from a task list. \index{rmmod} \index{reference count} \index{module\_cleanup} Since {\tt cleanup\_module} can't return with an error code (it's a void function), the solution is to not let it return at all. Instead, it calls {\tt sleep\_on} or {\tt module\_sleep\_on}\footnote{They're really the same} to put the {\tt rmmod} process to sleep. Before that, it informs the function called on the timer interrupt to stop attaching itself by setting a global variable. Then, on the next timer interrupt, the {\tt rmmod} process will be woken up, when our function is no longer in the queue and it's safe to remove the module. \index{sleep\_on}\index{module\_sleep\_on} sourcesample(sched.c, 10_sched) \chapter{Interrupt Handlers}\label{int-handler} \index{interrupt handlers}\index{handlers\\interrupt} Except for the last chapter, everything we did in the kernel so far we've done as a response to a process asking for it, either by dealing with a special file, sending an {\tt ioctl}, or issuing a system call. But the job of the kernel isn't just to respond to process requests, another job, which is every bit as important, is to speak to the hardware connected to the machine. There are two types of interaction between the CPU and the rest of the computer's hardware. The first type is when the CPU gives orders to the hardware, the other is when the hardware needs to tell the CPU something. The second, called interrupts, is much harder to implement because it has to be dealt with when convenient for the hardware, not the CPU. Hardware devices typically have a very small amount of ram, and if you don't read their information when available, it is lost. Under Linux, hardware interrupts are called IRQs (short for {\bf I}nterrupt {\bf R}e{\bf q}eusts)\footnote{This is standard nomecalture on the Intel architecture where linux originated.}. There are two types of IRQs, short and long. A short IRQ is one which is expected to take a {\bf very} short period of time, during which the rest of the machine will be blocked and no other interrupts will be handled. A long IRQ is one which can take longer, and during which other interrupts may occur (but not interrupts from the same device). If at all possible, it's better to declare an interrupt handler to be long. When the CPU receives an interrupt, it stops whatever it's doing (unless it's processing a more important interrupt, in which case it will deal with this one only when the more important one is done), saves certain parameters on the stack and calls the interrupt handler. This means that certain things are not allowed in the interrupt handler itself, because the system is in an unknown state. The solution to this problem is for the interrupt handler to do what needs to be done immediately, usually read something from the hardware or send something to the hardware, and then scheduale the handling of the new information at a later time (this is called the ``bottom half'') and return. The kernel is then guaranteed to call the bottom half as soon as possible --- and when it does, everything allowed in kernel modules will be allowed. \index{bottom half} The way to implement this is to call {\tt request\_irq} to get your interrupt handler called when the relevant IRQ is received (there are 16 of them on Intel platforms). This function receives the IRQ number, the name of the function, flags, a name for {\tt /proc/interrupts} and a parameter to pass to the interrupt handler. The flags can include {\tt SA\_SHIRQ} to indicate you're willing to share the IRQ with other interrupt handlers (usually because a number of hardware devices sit on the same IRQ) and {\tt SA\_INTERRUPT} to indicate this is a fast interrupt. This function will only succeed if there isn't already a handler on this IRQ, or if you're both willing to share. \index{request\_irq} \index{/proc/interrupts} \index{SA\_SHIRQ} \index{SA\_INTERRUPT} Then, from within the interrupt handler, we communicate with the hardware and then use {\tt queue\_task\_irq} with {\tt tq\_immediate} and {\tt mark\_bh(BH\_IMMEDIATE)} to scheduale the bottom half. The reason we can't use the standard {\tt queue\_task} is that the interrupt might happen right in the middle of somebody else's {\tt queue\_task} ({\tt queue\_task\_irq} is protected from this by a global lock). We need {\tt mark\_bh} because earlier versions of Linux only had an array of 32 bottom halves, and now one of them ({\tt BH\_IMMEDIATE}) is used for the linked list of bottom halves for drivers which didn't get a bottom half entry assigned to them. \index{queue\_task\_irq} \index{tq\_immediate} \index{mark\_bh} \index{BH\_IMMEDIATE} \section{Keyboards on the Intel Architecture}\label{keyboard} \index{keyboard}\index{intel architecture\\keyboard} {\bf Warning: The rest of this chapter is completely Intel specific. If you're not running on an Intel platform, it will not work. Don't even try to compile the code here.} I had a problem with writing the sample code for this chapter. On one hand, for an example to be useful it has to run on everybody's computer with meaningful results. On the other hand, the kernel already includes device drivers for all of the common devices, and those device drivers won't coexist with what I'm going to write. The solution I've found was to write something for the keyboard interrupt, and disable the regular keyboard interrupt handler first. Since it is defined as a static symbol in the kernel source files (specifically, {\tt drivers/char/keyboard.c}), there is no way to restore it. Before insmod'ing this code, do on another terminal {\tt sleep 120 ; reboot} if you value your file system. This code binds itself to IRQ 1, which is the IRQ of the keyboard controlled under Intel architectures. Then, when it receives a keyboard interrupt, it reads the keyboard's status (that's the purpose of the {\tt inb(0x64)}) and the scan code, which is the value returned by the keyboard. Then, as soon as the kernel think it's feasible, it runs {\tt got\_char} which gives the code of the key used (the first seven bits of the scan code) and whether it has been pressed (if the 8th bit is zero) or released (if it's one). \index{inb} sourcesample(intrpt.c, 11_intrp) % \chapter{Symmetrical Multi--Processing}\label{smp} % \index{SMP} % \index{multi-processing} % \index{symmetrical multi--processing} % \index{processing\\multi} % % One of the easiest (read, cheapest) ways to improve hardware performance is % to put more than one CPU on the board. This can be done either making the % different CPUs take on different jobs (asymmetrical multi--processing) or by % making them all run in parallel, doing the same job (symmetrical % multi--processing). Doing asymmetrical multi--processing effectively % requires specialized knowledge about the tasks the computer should do, which % is unavailable in a general purpose operating system such as Linux. On the % other hand, symmetrical multi--processing is relatively easy to implement. % \index{CPU\\multiple} % % By relatively easy, I mean exactly that --- not that it's {\em really} % easy. In a symmetrical multi--processing environment, the CPUs share the % same memory, and as a result code running in one CPU can affect the memory % used by another. You can no longer be certain that a variable you've set to % a certain value in the previous line still has that value --- the other % CPU might have played with it while you weren't looking. Obviously, it's % impossible to program like this. % % In practice, the problem is not that bad, because each process only runs on % a single CPU and processes can't access each other's memory anyway. % However, each process is allowed to % % Find the way to lock the critical section of the code if necessary. Not % necessary right now. \appendix \chapter{Where From Here?}\label{where-to} I could easily have squeezed a few more chapters into this book. I could have added a chapter about creating new file systems, or about adding new protocols stacks (as if there's a need for that --- you'd have to dig under ground to find a protocol stack not supported by Linux). I could have added explanations of the kernel mechanisms we haven't touched upon, such as bootstrapping or the disk interface. However, I chose not to. My purpose in writing this book was to provide initiation into the mysteries of kernel module programming and to teach the common techniques for that purpose. For people seriously interested in kernel programming, there are already two books in the Linux Documentation Project which explain how things are done in the kernel. If you prefer a paper version, there are a number of books out there. And, as Linus said, the best way is to learn the kernel is to read the source code yourself. If you're interested in more examples of short kernel modules, I recommend Phrack magazine. Even if you're not interested in security, and as a programmer you should be, the kernel modules there are good examples of what you can do inside the kernel, and they're short enough not to require too much effort to understand. I hope I have helped you in your quest to become a better programmer, or at least to have fun through technology. And, if you do write useful kernel modules, I hope you publish them under the GPL, so I can use them too. include(thankme.m4) include(gpl.m4) \addcontentsline{toc}{chapter}{Index} \input{mpg.ind} \end{document}