Quick file search

From a translator: I bring to your attention a translation of a very old article published on January 15, 1983. Despite such an impressive age, the article seemed interesting to me, and it is possible that it will be useful to someone today. By the way, this article is referenced by the man locate (1) help topic on opennet.ru: https://www.opennet.ru/man.shtml?topic=locate .



Summary


This article describes a quick file search mechanism on UNIX. It combines two data compression methods with a new line search technique, and is designed to quickly search for arbitrary files. The code, integrated into the standard find utility, searches the previously created database, updated daily. This distinguishes it from the usual mechanism for searching for key matches with candidates, which are generated on the fly from a scattered (on disk) directory structure.

The file path database is an incrementally encoded, lexicographically sorted list (sometimes called a “front compressed” file), which is also subjected to regular bigramic encoding in order to obtain effective compression. The compression ratio is 5 to 6 compared to the usual ASCII representation. The list is scanned using a modified linear search, specially adapted for incremental coding, while the typical time spent by the algorithm is 40-50% less than a regular search.

Introduction


Finding files in a computer system, or computer network, is a complex process. UNIX users can implement it in a variety of ways, from manipulating the cd, ls, and grep commands, to specialized commands, such as those developed in Berkeley whereis and fleese, and to the more common find Unix command.

Unfortunately, fleece (from Berkeley) is limited to the home directory, and whereis can only search for system code and documentation located in standard places.

Search:

find / -name "*<filename>*" -print

of course, it can search for arbitrary files, but very slowly, because it uses a recursive descent throughout the file system, ruthlessly scattered throughout the disk. Impatience prompted us to develop an alternative (in relation to the “search and find” method) search for paths to files.

Preliminary calculations


Why not just build a static list of all the files in the system and look for grep on it? A typical system containing 20,000 files will contain 1,000 blocks of file names, even if you shorten / usr to / u. Grep on our unloaded PDP-11/70 system processes 30-40 blocks per second, and will require half a minute to scan. This is not acceptable for a commonly used command.

But if we make a small sacrifice - the inability to search for files less than a day old, then this will not create big problems, since the one who created such a file is probably within reach, or the file may not be ready yet use. Older files created by other user groups with different file naming conventions are the most likely search candidates.

Compression


To speed up access for applications, you can suggest using a binary search or a hash table, but such schemes do not work well for partial comparisons, if we are only interested in a part of the name. The ordered data compression technique, known as incremental coding, that was used for a similar dictionary compression task is easily implemented [Morris / Thompson, 1974]. In this case, the longest prefix of the previous name is calculated. For instance:

/usr/src
/usr/src/cmd/aardvark.c
/usr/src/cmd/armadillo.c
/usr/tmp/zoo


converted to

0/usr/src
8/cmd/aardvark.c
14armadillo.c
5tmp/zoo

Decoding will be simple (variable declarations omitted)

fp = fopen(COMPRESSED_FILELIST, "r");
while((count = (getc(fp) & 0177)) != EOF) {
    for(p = path + count; (*p++ = getc(fp)) < 0200; )
        ; /*overlay old path with new*/
    ungetc(*--p, fp);
    *p-- = NULL;
    if(math(path, name) == YES)
        puts(path);
}

where math is a function that determines whether the name substring is contained in the path string .

In fact, since the encoded list is about five times shorter than the uncoded one, and decoding is very simple, the program runs three to four times faster than grep on the text file.

Even faster


This approach is already useful, but leaves room for further improvements. (Note: this code is used in the find utility. There is no need to burden UNIX with another command [and man page] when we can improve an existing similar program. Fortunately, there is no find call with two arguments, and we can fill the void without embellishment :

find name

Note that the code above still searches the unpacked list, albeit in memory, not on disk. This can be avoided by comparing a string with a substring in the opposite direction. Let namend point to the last character of the string name , and replace math with:

for(s = p, cutoff = path + count; s > cutoff; s--) {
    if(*s == namend) { /*quick first char check*/
        for(p = namend - 1, q = s - 1; *p != NULL; p--, q--)
            if(*q != *p)
                break;
        if(*p == NULL) {
            puts(path);
            break;
        }
    }
}

This is easier to understand by looking at three cases. If the substring is completely to the right of cutoff , the comparison will end successfully. If they overlap, cutoff becomes “soft” and the comparison continues. If the substring lies completely to the left of cutoff , then a match would be revealed for the previous lines, which means that we can not perform this search! Technically, cutoff can be ported to path immediately after the comparison is done. This condition is omitted above.

Two-tier equipment


You can avoid slowing down the search caused by the processing of search pattern characters . In doing so, we process the part of the name that does not contain metacharacters and use the slower recursive function amath inside find . I.e,

puts(path)

becomes

if(globchars == NO | amatch(path, name))
    puts(path);

where globchars is set if name contains metacharacters. An example of using a search pattern for a simple man command :

vtroff -man 'find '*man*'"$1"'.[1-9]

Further improvements


The production code of the find utility increases the compression ratio by another 20-25% by replacing the most common two-character combinations with non-printable ASCII codes. ".c" and ".P" are especially common.

Other algorithms that implement other trade-offs between time and data size, such as the Huffman algorithm [Reghbati, 1981], do not look promising: they just replace the I / O performance limit with the performance limit.

Boyer-Moore sub-linear search methods [Boyer, 1977] or macro-model methods [Storer / Szymanski, 1982] can be used.

In conclusion, it should be noted that we scanned 19,000 file names in a few seconds, using 180 blocks and a C code that fits on two pages.

References


Boyer, RS A Fast String Searching Algorithm, Commun. ACM, Vol. 20, No. 10, October 1977.
Morris, R. and Thompson, K. Webster's Second on the Head of a Pin, Unpublished Technical Memo, Bell Laboratories, Murray Hill, NY, 1974.
Reghbati, HK An Overview of Data Compression Techinques, Computer, Vol. 14, No. 4, April 1981.
Storer, JA and Szymanski, TG Data Compression via Textual Substitution, J. ACM, Vol. 29, No. 4, October 1982.

All Articles