EOF is not a symbol

Recently, I read the book “Computer Systems: Architecture and Programming. The look of the programmer. " There, in the chapter on the Unix I / O system, the authors mentioned that there is no special character at the end of the file EOF. If you read about the Unix / Linux I / O system, or experimented with it, if you wrote C programs that read data from files, then this statement will probably seem completely obvious to you. But let's take a closer look at the following two statements related to what I found in the book:





  1. EOF - this is not a symbol.
  2. There is no special character at the end of the files.

What is this EOF?

EOF is not a symbol


Why does someone say or think that EOFthis is a symbol? I suppose this may be because some programs written in C can find code that uses explicit checking for EOFusing functions getchar()and getc().

It might look like this:

    #include <stdio.h>
    ...
    while ((c = getchar()) != EOF)
      putchar(c);

Or so:

    FILE *fp;
    int c;
    ...
    while ((c = getc(fp)) != EOF)
      putc(c, stdout);

If you look at the help for getchar()or getc(), you can find out that both functions read the next character from the input stream. Probably - this is precisely what causes the misconception about nature EOF. But these are just my assumptions. Let us return to the idea that EOF- this is not a symbol.

And what is a symbol in general? A symbol is the smallest component of text. “A”, “a”, “B”, “b” - all these are different symbols. A character has a numeric code, which in the Unicode standard is called a code point . For example, the Latin letter “A” has, in decimal, the code 65. This can be quickly checked using the command line of the Python interpreter:

$python
>>> ord('A')
65
>>> chr(65)
'A'

Or you can take a look at the ASCII table on Unix / Linux:

$ man ascii


We will find out which code corresponds EOFby writing a small program in C. In ANSI C, a constant is EOFdefined in stdio.h, it is part of the standard library. Usually written to this constant -1. You can save the following code in a file printeof.c, compile it and run it:

#include <stdio.h>

int main(int argc, char *argv[])
{
  printf("EOF value on my system: %d\n", EOF);
  return 0;
}

Compile and run the program:

$ gcc -o printeof printeof.c

$ ./printeof
EOF value on my system: -1

I have this program, tested on Mac OS and on Ubuntu, reports that EOFequals -1. Is there any character with this code? Here, again, you can check the character codes in the ASCII table, you can look at the Unicode table and find out in what range the character codes can be. We will act differently: we will start the Python interpreter and use the standard function chr()to give us the symbol corresponding to the code -1:

$ python
>>> chr(-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(0x110000)

As expected, the character with the code -1does not exist. So, in the end, EOFand the truth is not a symbol. We now turn to the second statement under consideration.

There is no special character at the end of the files.


Maybe EOF- this is a special character that can be found at the end of the file? I suppose you already know the answer. But let's carefully check our assumption.

Take a simple text file, helloworld.txt , and display its contents in hexadecimal representation. To do this, you can use the command xxd:

$ cat helloworld.txt
Hello world!

$ xxd helloworld.txt
00000000: 4865 6c6c 6f20 776f 726c 6421 0a         Hello world!.

As you can see, the last character of the file has a code 0a. From the ASCII table, you can find out that this code corresponds to a character nl, that is, to a newline character. You can find this out using Python:

$ python
>>> chr(0x0a)
'\n'

So. EOF- This is not a symbol, and at the end of the files there is no special symbol. What is this EOF?

What is an EOF?


EOF(end-of-file) is a state that can be detected by the application in a situation where the file read operation reaches its end.

Let’s take a look at how you can detect the state EOFin different programming languages ​​when reading a text file using the high-level input-output tools provided by these languages. To do this, we will write a very simple version cat, which will be called mcat. It reads ASCII text byte (character) and explicitly checks for EOF. We will write the program in the following languages:

  • ANSI C
  • Python 3
  • Go
  • JavaScript (Node.js)

Here is a repository with sample code. We proceed to their analysis.

ANSI C


Let's start with the venerable C. The program presented here is a modified version catof the book "C Programming Language."

/* mcat.c */
#include <stdio.h>

int main(int argc, char *argv[])
{
  FILE *fp;
  int c;

  if ((fp = fopen(*++argv, "r")) == NULL) {
    printf("mcat: can't open %s\n", *argv);
    return 1;
  }

  while ((c = getc(fp)) != EOF)
    putc(c, stdout);

  fclose(fp);

  return 0;
}

Compilation:

$ gcc -o mcat mcat.c

Launch:

$ ./mcat helloworld.txt
Hello world!

Here are some explanations regarding the above code:

  • The program opens the file passed to it as a command line argument.
  • The loop whilecopies data from the file to the standard output stream. The data is copied byte by byte, this happens until the end of the file is reached.
  • When the program reaches EOF, it closes the file and exits.

Python 3


In Python, there is no mechanism for explicitly checking for EOF, similar to the one that is available in ANSI C. But if you read the file character by character, you can reveal the state EOFif the variable that stores the next read character is empty:

# mcat.py
import sys

with open(sys.argv[1]) as fin:
    while True:
        c = fin.read(1) #   1 
        if c == '':     # EOF
            break
        print(c, end='')

Run the program and take a look at the results returned to it:

$ python mcat.py helloworld.txt
Hello world!

Here is a shorter version of the same example written in Python 3.8+. Here the operator is used : = (it is called the “walrus operator” or “walrus operator”):

# mcat38.py
import sys

with open(sys.argv[1]) as fin:
    while (c := fin.read(1)) != '':  #   1    EOF
        print(c, end='')

Run this code:

$ python3.8 mcat38.py helloworld.txt
Hello world!

Go


In Go, you can explicitly check the error returned by Read () to see if it indicates that we got to the end of the file:

// mcat.go
package main

import (
    "fmt"
    "os"
    "io"
)

func main() {
    file, err := os.Open(os.Args[1])
    if err != nil {
        fmt.Fprintf(os.Stderr, "mcat: %v\n", err)
        os.Exit(1)
    }

    buffer := make([]byte, 1// 1-byte buffer
    for {
        bytesread, err := file.Read(buffer)
        if err == io.EOF {
            break
        }
        fmt.Print(string(buffer[:bytesread]))
    }
    file.Close()
}

Run the program:

$ go run mcat.go helloworld.txt
Hello world!

JavaScript (Node.js)


Node.js has no mechanism for explicitly checking for EOF. But, when, upon reaching the end of the file, an attempt is made to read something else, the end stream event is raised .

/* mcat.js */
const fs = require('fs');
const process = require('process');

const fileName = process.argv[2];

var readable = fs.createReadStream(fileName, {
  encoding: 'utf8',
  fd: null,
});

readable.on('readable', function() {
  var chunk;
  while ((chunk = readable.read(1)) !== null) {
    process.stdout.write(chunk); /* chunk is one byte */
  }
});

readable.on('end', () => {
  console.log('\nEOF: There will be no more data.');
});

Run the program:

$ node mcat.js helloworld.txt
Hello world!

EOF: There will be no more data.

Low level system mechanisms


How do the high-level I / O mechanisms used in the examples above determine the end of the file? On Linux, these mechanisms directly or indirectly use the read () system call provided by the kernel. A function (or macro) getc()from C, for example, uses a system call read()and returns EOFif it read()indicates the state of reaching the end of the file. In this case, read()returns 0. If you depict all this in the form of a diagram, you get the following:


It turns out that the function is getc()based on read().

We will write a version catnamed syscatusing only Unix system calls. We will do this not only out of interest, but also because it may very well bring us some benefit.

Here is this program written in C:

/* syscat.c */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
  int fd;
  char c;

  fd = open(argv[1], O_RDONLY, 0);

  while (read(fd, &c, 1) != 0)
    write(STDOUT_FILENO, &c, 1);

  return 0;
}

Run it:

$ gcc -o syscat syscat.c

$ ./syscat helloworld.txt
Hello world!

This code uses the fact that the function read(), indicating the end of the file is reached, returns 0.

Here is the same program written in Python 3:

# syscat.py
import sys
import os

fd = os.open(sys.argv[1], os.O_RDONLY)

while True:
    c = os.read(fd, 1)
    if not c:  # EOF
        break
    os.write(sys.stdout.fileno(), c)

Run it:

$ python syscat.py helloworld.txt
Hello world!

Here is the same thing written in Python 3.8+:

# syscat38.py
import sys
import os

fd = os.open(sys.argv[1], os.O_RDONLY)

while c := os.read(fd, 1):
    os.write(sys.stdout.fileno(), c)

Run this code too:

$ python3.8 syscat38.py helloworld.txt
Hello world!

Summary


  • EOF - this is not a symbol.
  • There is no special character at the end of the files.
  • EOF - this is the state reported by the kernel and which can be detected by the application in the case when the data reading operation reaches the end of the file.
  • In ANSI C EOF, this is again not a character. This is the constant defined stdio.hin which the value -1 is usually written.
  • A “character” EOFcannot be found in an ASCII table or in Unicode.

Dear readers! Do you know about any more or less widespread misconceptions from the world of computers?


All Articles