👨🏿‍🏫 🍟 🤙🏽 How Unix pipelines are implemented 💼 🤰🏽 👎

This article describes the implementation of pipelines in the Unix kernel. I was somewhat disappointed with a recent article entitled “ How do pipelines work on Unix?” "Was not about the internal device. I became interested, and I buried myself in the old sources to find the answer.

What are we talking about?

Pipelines - "probably the most important invention on Unix" - is the defining characteristic of Unix's underlying philosophy of combining small programs together, as well as the familiar command line:

$ echo hello | wc -c
6

This functionality depends on the system call provided by the kernel pipe, which is described on the pipe (7) and pipe (2) documentation pages :

Conveyors provide a unidirectional interprocess communication channel. The pipeline has an input (write end) and an output (read end). Data written to the input of the pipeline can be read out.

The pipeline is created using a call pipe(2)that returns two file descriptors: one refers to the input of the pipeline, the second to the output.

The trace results of the above command demonstrate the creation of a pipeline and the flow of data through it from one process to another:

$ strace -qf -e execve,pipe,dup2,read,write \
    sh -c 'echo hello | wc -c'

execve("/bin/sh", ["sh", "-c", "echo hello | wc -c"], …)
pipe([3, 4])                            = 0
[pid 2604795] dup2(4, 1)                = 1
[pid 2604795] write(1, "hello\n", 6)    = 6
[pid 2604796] dup2(3, 0)                = 0
[pid 2604796] execve("/usr/bin/wc", ["wc", "-c"], …)
[pid 2604796] read(0, "hello\n", 16384) = 6
[pid 2604796] write(1, "6\n", 2)        = 2

The parent process calls pipe()to get the attached file descriptors. One child process writes to one descriptor, and another process reads the same data from another descriptor. The wrapper using dup2 “renames” descriptors 3 and 4 to match stdin and stdout.

Without pipelines, the shell would have to write the result of one process to a file and transfer it to another process so that it reads the data from the file. As a result, we would spend more resources and disk space. However, pipelines are good not only because they avoid the use of temporary files:

, read(2) , . , write(2) , .

Like the POSIX-requirement, this is an important property: writing to the pipeline up to PIPE_BUFbytes (at least 512) must be atomic so that processes can interact with each other through the pipeline in the same way as regular files (which do not provide such guarantees) can.

When using a regular file, a process can write all its output data to it and transfer it to another process. Or processes can operate in hard parallelization mode, using an external signaling mechanism (such as a semaphore) to inform each other about the completion of writing or reading. Conveyors save us all this trouble.

What are we looking for?

I’ll explain it on my fingers to make it easier for you to imagine how the conveyor can work. You will need to allocate a buffer and some state in memory. You will need functions to add and remove data from the buffer. It will take some means to call functions during read and write operations to file descriptors. And locks are needed to implement the special behavior described above.

Now we are ready to interrogate in the bright light of the lamps the source code of the kernel to confirm or refute our vague mental model. But always be prepared for the unexpected.

Where are we looking?

I don’t know where my copy of the famous book “ Lions book ” with Unix 6 source code is located, but thanks to The Unix Heritage Society, you can search online for even older versions of Unix in the source code .

Wandering around the TUHS archives is akin to visiting a museum. We can take a look at our common history, and I respect the many years of efforts to recover all these materials bit by bit from old cassettes and printouts. And I am acutely aware of those fragments that are still missing.

Having satisfied our curiosity regarding the ancient history of conveyors, we can look at modern cores for comparison.

By the way, pipeis a system call number 42 in the table sysent[]. Coincidence?

Traditional Unix kernels (1970–1974)

I found no trace pipe(2)in PDP-7 Unix (January 1970), nor in the first edition of Unix (November 1971), nor in the incomplete source code of the second edition (June 1972).

TUHS claims that the third edition of Unix (February 1973) was the first version with pipelines:

The third edition of Unix was the latest version with a kernel written in assembly language, but the first version with pipelines. During 1973, work was done to improve the third edition, the core was rewritten in C, and so the fourth edition of Unix appeared.

One of the readers found a scan of a document in which Doug McIlroy proposed the idea of “connecting programs by the principle of a garden hose.”

In the book of Brian Kernighan “ Unix: A History and a Memoir ”, in the history of the appearance of conveyors, this document is also mentioned: “... it hung on the wall in my office at Bell Labs for 30 years.” Here's an interview with McIlroy , and another story from McIlroy's work written in 2014 :

Unix, , , , - , , . , . , , , . ? «» , , -, : « !».

. , , ( ), . . . API , .

Unfortunately, the kernel source code for the third edition of Unix is lost. And although we have the source code for the fourth edition kernel written in C , released in November 1973, it was released a few months before the official release and does not contain pipeline implementations. It is a pity that the source code of the legendary Unix function is lost, possibly forever.

We have the text of the documentation pipe(2)from both releases, so you can start by searching the third edition of the documentation (for certain words, underlined “manually”, a string of literals ^ H, followed by underscores!). This proto is pipe(2)written in assembler and returns only one file descriptor, but already provides the expected basic functionality:

The pipe system call creates an output input mechanism called a pipeline. The returned file descriptor can be used for read and write operations. When something is written to the pipeline, up to 504 bytes of data are buffered, after which the writing process is paused. When reading from a pipeline, buffered data is taken.

By the next year, the kernel was rewritten in C, and pipe (2) in the fourth edition found its modern look with the prototype " pipe(fildes)":

pipe , . . - , , r1 (. fildes[1]), 4096 , . , r0 (. fildes[0]), .

, ( ) ( fork) read write.

The shell has syntax for defining a linear array of processes connected via a pipeline.

Read calls from an empty pipeline (not containing buffered data) that has only one end (all writing file descriptors are closed) return the “end of file”. Recording calls in a similar situation are ignored.

The earliest surviving pipeline implementation dates back to the fifth edition of Unix (June 1974), but it is almost identical to the one that appeared in the next release. Only comments were added, so the fifth edition can be skipped.

Sixth Edition of Unix (1975)

We begin to read the sixth edition Unix source code (May 1975). Largely thanks to Lions, finding it is much easier than the source code of earlier versions:

For many years, Lions has been the only Unix core document available outside the walls of Bell Labs. Although the license of the sixth edition allowed teachers to use its source code, the license of the seventh edition excluded this possibility, so the book was distributed in the form of illegal typewritten copies.

Today you can buy a reprint copy of the book, on the cover of which students are shown at the photocopier. And thanks to Warren Tumi (who launched the TUHS project), you can download the PDF file with the source code for the sixth edition . I want to give you an idea of how much effort it took to create the file:

15 , Lions, . TUHS , . 1988- 9 , PDP11. , , /usr/src/, 1979- , . PWB, .

. , , += =+. - , - , .

And today we can read online on TUHS the source code of the sixth edition from the archive, to which Dennis Richie had a hand .

By the way, at first glance, the main feature of the C-code before the Kernigan and Richie period is its brevity . Not so often, I manage to embed code snippets without extensive editing to fit a relatively narrow display area on my site.

At the beginning of /usr/sys/ken/pipe.c there is an explanatory comment (and yes, there is also / usr / sys / dmr ):

/*
 * Max allowable buffering per pipe.
 * This is also the max size of the
 * file created to implement the pipe.
 * If this size is bigger than 4096,
 * pipes will be implemented in LARG
 * files, which is probably not good.
 */
#define    PIPSIZ    4096

The buffer size has not changed since the fourth edition. But here, without any public documentation, we see that once the pipelines used files as backup storage!

As for LARG files, they correspond to the LARG inode flag , which is used by the "high addressing algorithm" to process indirect blocks in order to support larger file systems. Since Ken said it’s better not to use them, I will gladly take his word for it.

Here is the real system call pipe:

/*
 * The sys-pipe entry.
 * Allocate an inode on the root device.
 * Allocate 2 file structures.
 * Put it all together with flags.
 */
pipe()
{
    register *ip, *rf, *wf;
    int r;

    ip = ialloc(rootdev);
    if(ip == NULL)
        return;
    rf = falloc();
    if(rf == NULL) {
        iput(ip);
        return;
    }
    r = u.u_ar0[R0];
    wf = falloc();
    if(wf == NULL) {
        rf->f_count = 0;
        u.u_ofile[r] = NULL;
        iput(ip);
        return;
    }
    u.u_ar0[R1] = u.u_ar0[R0]; /* wf's fd */
    u.u_ar0[R0] = r;           /* rf's fd */
    wf->f_flag = FWRITE|FPIPE;
    wf->f_inode = ip;
    rf->f_flag = FREAD|FPIPE;
    rf->f_inode = ip;
    ip->i_count = 2;
    ip->i_flag = IACC|IUPD;
    ip->i_mode = IALLOC;
}

The commentary clearly describes what is happening here. But to understand the code is not easy, partly because of the way with the help of « a struct user u » and registers R0and R1transferred parameters of system calls and return values.

Let's try using ialloc () to place inode (inode ) on the disk , and using falloc () to place two files in memory . If everything goes well, we will set flags to define these files as the two ends of the pipeline, point them to the same inode (whose reference count will be 2), and mark inode as changed and used. Pay attention to calls to iput ()in error paths to decrease the reference count in the new inode.

pipe()must through R0and R1return the file descriptor numbers for reading and writing. falloc()returns a pointer to the file structure, but also “returns” through the u.u_ar0[R0]file descriptor. That is, the code saves to a rfile descriptor for reading and assigns a descriptor for writing directly from u.u_ar0[R0]after the second call falloc().

The flag FPIPEthat we set when creating the pipeline controls the behavior of the rdwr () function in sys2.c , which calls specific I / O I / O routines:

/*
 * common code for read and write calls:
 * check permissions, set base, count, and offset,
 * and switch out to readi, writei, or pipe code.
 */
rdwr(mode)
{
    register *fp, m;

    m = mode;
    fp = getf(u.u_ar0[R0]);
        /* … */

    if(fp->f_flag&FPIPE) {
        if(m==FREAD)
            readp(fp); else
            writep(fp);
    }
        /* … */
}

Then the function readp()in pipe.creads the data from the pipeline. But tracing the implementation is better starting with writep(). Once again, the code got more complicated due to the specifics of the argument transfer agreement, but some details can be omitted.

writep(fp)
{
    register *rp, *ip, c;

    rp = fp;
    ip = rp->f_inode;
    c = u.u_count;

loop:
    /* If all done, return. */

    plock(ip);
    if(c == 0) {
        prele(ip);
        u.u_count = 0;
        return;
    }

    /*
     * If there are not both read and write sides of the
     * pipe active, return error and signal too.
     */

    if(ip->i_count < 2) {
        prele(ip);
        u.u_error = EPIPE;
        psignal(u.u_procp, SIGPIPE);
        return;
    }

    /*
     * If the pipe is full, wait for reads to deplete
     * and truncate it.
     */

    if(ip->i_size1 == PIPSIZ) {
        ip->i_mode =| IWRITE;
        prele(ip);
        sleep(ip+1, PPIPE);
        goto loop;
    }

    /* Write what is possible and loop back. */

    u.u_offset[0] = 0;
    u.u_offset[1] = ip->i_size1;
    u.u_count = min(c, PIPSIZ-u.u_offset[1]);
    c =- u.u_count;
    writei(ip);
    prele(ip);
    if(ip->i_mode&IREAD) {
        ip->i_mode =& ~IREAD;
        wakeup(ip+2);
    }
    goto loop;
}

We want to write bytes to the input of the pipeline u.u_count. First we need to lock the inode (see below plock/ prele).

Then check the inode reference count. While both ends of the pipeline remain open, the counter should be 2. We keep one link (out rp->f_inode), so if the counter is less than 2, this should mean that the reading process has closed its end of the pipeline. In other words, we are trying to write in a closed pipeline, and this is a mistake. The error code EPIPEand signal first SIGPIPEappeared in the sixth edition of Unix.

But even if the conveyor is open, it can be full. In this case, we remove the lock and go to sleep in the hope that another process will read from the pipeline and free up enough space in it. Having woken up, we return to the beginning, again we block the lock and start a new recording cycle.

If there is enough free space in the pipeline, then we write data to it using writei () . The parameter i_size1in inode (with an empty pipeline can be 0) indicates the end of the data that it already contains. If there is enough recording space, we can fill the conveyor from i_size1toPIPESIZ. Then we remove the lock and try to awaken any process that is waiting for the opportunity to read from the pipeline. We go back to the beginning to see if we managed to write as many bytes as we needed. If it failed, then we begin a new recording cycle.

Typically i_mode, an inode parameter is used to store permissions r, wand x. But in the case of pipelines, we signal that some process is waiting to write or read using bits IREADand, IWRITErespectively. A process sets a flag and calls sleep(), and it is expected that some other process will call in the future wakeup().

Real magic happens in sleep()and wakeup(). They are implemented in slp.c, the source of the famous commentary, “You are not expected to understand this.” Fortunately, we are not required to understand the code, just see some comments:

/*
 * Give up the processor till a wakeup occurs
 * on chan, at which time the process
 * enters the scheduling queue at priority pri.
 * The most important effect of pri is that when
 * pri<0 a signal cannot disturb the sleep;
 * if pri>=0 signals will be processed.
 * Callers of this routine must be prepared for
 * premature return, and check that the reason for
 * sleeping has gone away.
 */
sleep(chan, pri) /* … */

/*
 * Wake up all processes sleeping on chan.
 */
wakeup(chan) /* … */

A process that invokes sleep()for a particular channel may later be woken up by another process that will invoke wakeup()for the same channel. writep()and readp()coordinate their actions through such paired calls. Please note that pipe.calways gives priority PPIPEwhen calling sleep(), so everyone sleep()can interrupt on signal.

Now we have everything to understand the function readp():

readp(fp)
int *fp;
{
    register *rp, *ip;

    rp = fp;
    ip = rp->f_inode;

loop:
    /* Very conservative locking. */

    plock(ip);

    /*
     * If the head (read) has caught up with
     * the tail (write), reset both to 0.
     */

    if(rp->f_offset[1] == ip->i_size1) {
        if(rp->f_offset[1] != 0) {
            rp->f_offset[1] = 0;
            ip->i_size1 = 0;
            if(ip->i_mode&IWRITE) {
                ip->i_mode =& ~IWRITE;
                wakeup(ip+1);
            }
        }

        /*
         * If there are not both reader and
         * writer active, return without
         * satisfying read.
         */

        prele(ip);
        if(ip->i_count < 2)
            return;
        ip->i_mode =| IREAD;
        sleep(ip+2, PPIPE);
        goto loop;
    }

    /* Read and return */

    u.u_offset[0] = 0;
    u.u_offset[1] = rp->f_offset[1];
    readi(ip);
    rp->f_offset[1] = u.u_offset[1];
    prele(ip);
}

You may find it easier to read this function from the bottom up. The “read and return” branch is usually used when there is some data in the pipeline. In this case, we use readi () to read as much data as is available starting from the current f_offsetread, and then update the value of the corresponding offset.

On subsequent reads, the pipeline will be empty if the read offset has reached the value of i_size1the inode. We reset the position to 0 and try to awaken any process that it wants to write to the pipeline. We know that when the conveyor is full, it writep()will fall asleep on ip+1. And now that the pipeline is empty, we can wake it up so that it resumes its recording cycle.

If there is nothing to read, it readp()can set a flag IREADand fall asleep onip+2. We know what will awaken him writep()when he writes some data to the pipeline.

The comments on readi () and writei () will help to understand that instead of passing parameters through " u", we can treat them as usual I / O functions that take a file, position, buffer in memory, and count the number of bytes to read or write .

/*
 * Read the file corresponding to
 * the inode pointed at by the argument.
 * The actual read arguments are found
 * in the variables:
 *    u_base        core address for destination
 *    u_offset    byte offset in file
 *    u_count        number of bytes to read
 *    u_segflg    read to kernel/user
 */
readi(aip)
struct inode *aip;
/* … */

/*
 * Write the file corresponding to
 * the inode pointed at by the argument.
 * The actual write arguments are found
 * in the variables:
 *    u_base        core address for source
 *    u_offset    byte offset in file
 *    u_count        number of bytes to write
 *    u_segflg    write to kernel/user
 */
writei(aip)
struct inode *aip;
/* … */

As for the "conservative" lock, then readp()and writep()block inode for as long as they finish a job or do not get the result (ie the cause wakeup). plock()and they prele()work simply: using a different set of calls sleepand wakeupallow us to wake up any process that needs a lock that we just removed:

/*
 * Lock a pipe.
 * If its already locked, set the WANT bit and sleep.
 */
plock(ip)
int *ip;
{
    register *rp;

    rp = ip;
    while(rp->i_flag&ILOCK) {
        rp->i_flag =| IWANT;
        sleep(rp, PPIPE);
    }
    rp->i_flag =| ILOCK;
}

/*
 * Unlock a pipe.
 * If WANT bit is on, wakeup.
 * This routine is also used to unlock inodes in general.
 */
prele(ip)
int *ip;
{
    register *rp;

    rp = ip;
    rp->i_flag =& ~ILOCK;
    if(rp->i_flag&IWANT) {
        rp->i_flag =& ~IWANT;
        wakeup(rp);
    }
}

At first I could not understand why it readp()did not call prele(ip)before the call wakeup(ip+1). The first thing that writep()causes in its loop is that it plock(ip)leads to deadlock if readp()it has not yet removed its block, so the code must somehow work correctly. If you look at it wakeup(), it becomes clear that he only marks the sleeping process as ready for execution, so that in the future it sched()really launches it. So it readp()causes wakeup(), unlocks, sets IREADand calls sleep(ip+2)- all this before it writep()resumes the cycle.

This completes the description of the conveyors in the sixth edition. Simple code, far-reaching consequences.

Seventh Edition of Unix(January 1979) was the new major release (four years later), in which many new applications and kernel properties appeared. Also, there have been significant changes in connection with the use of type casting, union'ov and typed pointers to structures. However, the pipeline code has not changed much. We can skip this edition.

Xv6, a simple Unix-shaped kernel

The sixth edition of Unix influenced the creation of the Xv6 core , but it is written in modern C to run on x86 processors. The code is easy to read, it is clear. In addition, unlike Unix sources with TUHS, you can compile it, modify it and run it on something else besides PDP 11/70. Therefore, this core is widely used in universities as educational material on operating systems. Sources are on Github .

The code contains a clear and well-thought-out implementation of pipe.c , backed up by a buffer in memory instead of inode on disk. Here I give only the definition of “structural pipeline” and function pipealloc():

#define PIPESIZE 512

struct pipe {
  struct spinlock lock;
  char data[PIPESIZE];
  uint nread;     // number of bytes read
  uint nwrite;    // number of bytes written
  int readopen;   // read fd is still open
  int writeopen;  // write fd is still open
};

int
pipealloc(struct file **f0, struct file **f1)
{
  struct pipe *p;

  p = 0;
  *f0 = *f1 = 0;
  if((*f0 = filealloc()) == 0 || (*f1 = filealloc()) == 0)
    goto bad;
  if((p = (struct pipe*)kalloc()) == 0)
    goto bad;
  p->readopen = 1;
  p->writeopen = 1;
  p->nwrite = 0;
  p->nread = 0;
  initlock(&p->lock, "pipe");
  (*f0)->type = FD_PIPE;
  (*f0)->readable = 1;
  (*f0)->writable = 0;
  (*f0)->pipe = p;
  (*f1)->type = FD_PIPE;
  (*f1)->readable = 0;
  (*f1)->writable = 1;
  (*f1)->pipe = p;
  return 0;

 bad:
  if(p)
    kfree((char*)p);
  if(*f0)
    fileclose(*f0);
  if(*f1)
    fileclose(*f1);
  return -1;
}

pipealloc()sets the state of the rest of the implementation, which includes functions piperead(), pipewrite()and pipeclose(). The actual system call sys_pipeis a wrapper implemented in sysfile.c . I recommend reading all of its code. The complexity is at the source level of the sixth edition, but it is much easier and more pleasant to read.

Linux 0.01

You can find the source code for Linux 0.01. It will be instructive to study the implementation of pipelines in his fs/ pipe.c. Here, inode is used to represent the pipeline, but the pipeline itself is written in modern C. If you got through the sixth edition code, then you will not experience difficulties. This is how the function looks write_pipe():

int write_pipe(struct m_inode * inode, char * buf, int count)
{
    char * b=buf;

    wake_up(&inode->i_wait);
    if (inode->i_count != 2) { /* no readers */
        current->signal |= (1<<(SIGPIPE-1));
        return -1;
    }
    while (count-->0) {
        while (PIPE_FULL(*inode)) {
            wake_up(&inode->i_wait);
            if (inode->i_count != 2) {
                current->signal |= (1<<(SIGPIPE-1));
                return b-buf;
            }
            sleep_on(&inode->i_wait);
        }
        ((char *)inode->i_size)[PIPE_HEAD(*inode)] =
            get_fs_byte(b++);
        INC_PIPE( PIPE_HEAD(*inode) );
        wake_up(&inode->i_wait);
    }
    wake_up(&inode->i_wait);
    return b-buf;
}

Even without looking at the definitions of structures, you can figure out how the inode reference counter is used to check if the write operation leads to SIGPIPE. In addition to byte work, this function is easily correlated with the above ideas. Even the logic sleep_on/ wake_updoes not look so alien.

Modern Linux, FreeBSD, NetBSD, OpenBSD kernels

I quickly went over some modern kernels. None of them already have a disk implementation (not surprisingly). Linux has its own implementation. Although the three modern BSD kernels contain implementations based on code that was written by John Dyson, over the years they have become too different from each other.

To read fs/ pipe.c(on Linux) or sys/ kern/ sys_pipe.c(on * BSD), true dedication is required. Performance and support for features such as vector and asynchronous I / O are important in code today. And the details of memory allocation, locks and kernel configuration - all this is very different. This is not what universities need for an introductory course on operating systems.

In any case, it was interesting for me to unearth several old patterns (for example, generating SIGPIPEand returning EPIPEwhen writing to a closed pipeline) in all of these, so different, modern cores. I probably will never see a live PDP-11 computer, but there is still much to learn from the code that was written a few years before my birth.

Written by Divi Kapoor in 2011, the article “ The Linux Kernel Implementation of Pipes and FIFOs ” provides an overview of how (so far) pipelines work on Linux. And the recent Linux commit illustrates a pipelined interaction model whose capabilities exceed the capabilities of temporary files; and also shows how far the pipelines have gone from “very conservative locking” in the sixth edition Unix kernel.

How Unix pipelines are implemented