Hans Peter Lun and the birth of the hash algorithm

The hash algorithm of an IBM engineer gave computers the ability to quickly search for documents, DNA, and databases.


Starting in the 1940s, Lun developed machines and systems for analyzing information, in particular, the currently widely used hash algorithm, which he proposed as a sorting method. numbers and text.

In November 1958 , during a six-day international conference on scientific information, inventor Hans Peter Lun demonstrated several electromechanical machines. They looked very ordinary. Like all other computing devices of the time, they were angular, practical and designed to receive and sort piles of punch cards in slots and baskets.

However, unlike other computers, the Moon devices did not work with numbers and formulas, but with words and sentences. One of the machines that attracted particular attention used the Moon algorithm called KWIC, Key for Word in Context . Having received a large amount of text - for example, an article from 500 to 5000 words, KWIC could quickly and autonomously build something like an index.

At that time, indexation, classification and organization of information in writing was an extremely time-consuming process, even for experienced professionals. And the volume of information in some areas was growing too fast to support it. Desperately needed new, better tools for abstracting and generalizing. For librarians and information scientists gathered in Washington, the KWIC demonstration was akin to an earthquake. Newspapers across the country immediately wrote about the outstanding invention of the Moon.

By the early 1960s, KWIC had become the basis for the development of hundreds of computerized indexing systems, including those used by the Chemical Abstracts Service, Biological Abstracts and the Institute of Scientific Information. Some experts called the development of KWIC "the greatest event in the world of chemistry since the invention of the test tube." Lun, IBM's senior engineer, also built a “smart system” for business with KWIC. (Note: it was he who first proposed the term BI) She could identify and deliver relevant information to specific individuals of large companies. In essence, KWIC was the equivalent of a search engine at the time: it allowed users to quickly find the information they needed.

Now we take for granted that computers can work with information and instantly offer us restaurant reviews, sports results, and stock prices. In the days of the moon, computers were simple and primitive. His attempts to work with text helped form a broader view of computers and their capabilities. And his ideas are still the basis of the algorithms that we use for online shopping, machine translation and genetic research. Of course, in the 1950s all this was completely unthinkable. Next, we’ll talk about what the moon led to a hash function, a solution to a problem that did not even exist.

The years after World War IIhad a huge impact on electronic computing devices. The various machines of the war years performed vital calculations for ballistics, atomic weapons, and cryptography. The ensuing Cold War provided computer developers with continuous funding, and as a result, machines became faster, more accurate and more powerful. But their main purpose - processing and storing numbers - remained unchanged.

In the nascent computer world, Lun was an unusual character. Always having a good taste for clothing, Lun understood much more in the textile industry than in computer science. He came to work at IBM in 1941. Numerous inventions of the Moon, it seemed, still belonged to the pre-digital era of mechanical calculators and slide rules. Even the digital computers of the 1950s were more powerful than its electromechanical devices. Nevertheless, the ideas of the Moon, one way or another rethought and transformed, are now applied in almost all types of software.

Hans Peter Lun was born in 1896 in the German city of Barman. His father, Johann Lun, a successful printer, was very loyal to the hobbies of his children. For example, once Hans, along with his brothers and sisters, built a miniature railway in the family garden. A 70-meter lead rails were smelted on his father’s machine.

After graduating from school, Lun went to study family craft in Switzerland. But the First World War and the draft in the German army interrupted his printed career. After the war, Lun began to trade in textiles. In the United States, he found himself in 1924 in search of potential places for the construction of textile factories. And even in the textile business, the inventive vein of the moon has found application. In 1927, he developed a special line with which it was possible to count the number of threads in the fabric.The Lunometer is still sold by HP Luhn & Associates, an engineering and consulting company founded by Moon.

Lun learned fast. He literally absorbed knowledge from various fields and gradually became an experienced climber, gourmet culinary specialist and a good landscape painter. By the 1930s, an extensive list of his patents included: a folding cloak , a device for shaping women's stockings, a game table, and the Cocktail Oracle - a guide that helped the user make a cocktail from what was at hand.


In 1933, shortly before the end of the Prohibition, Lun filed a patent for a guide that helped make a cocktail of the ingredients available.

But most of all, the Moon was interested in the storage, transmission and retrieval of information, especially textual information. Actually, these interests led him to IBM, where he received the “title” of the inventor. Lun turned out to be unusually prolific - during his career, he created about 70 patents for IBM. Despite the fact that no one limited him in his choice of tasks, many of his inventions were centered around the use of machines, including electronic ones, for manipulating information.

For example, in 1946 and 1947, Lun worked to teach machines to “read” documents typed on a typewriter. One of his devices consisted of a metal tape tucked into a typewriter that applied magnetic codes to paper. Then another machine could scan them. A little later, he, along with two MIT chemists, Malcolm Dyson and James Perry, began work on a machine that could automatically search for chemical compounds using punch cards. Each punch card was encoded with information about a particular connection. The user had to insert a “request card” into the machine and indicate a set of criteria by which it is necessary to compare and sort composite cards. The scanner turned out to be extremely narrowly specialized, and Lun continued to search for more universal ways of automatic processing of information.

The problem of information in those years was on everyone's lips. In the postwar period, the number of scientific and technical publications increased sharply. Many experts feared that business and science would fall due to information overload. Vannevar Bush , during the war, the leader of a major American scientific department and one of the initiators of the creation of the National Science Foundation, proposed a table- sized Memex electromechanical device to store and link information.

Bush’s proposal was never realized, unlike Moon’s ideas. For example, on January 6, 1954, he filed a patent for a Computer for Checking Numbers.. It was a manual mechanical device designed to solve a simple practical problem. At that time, various types of identification numbers, such as credit card numbers and social security numbers, began to play a large role in public and private life. However, the numbers were difficult to remember, they could be incorrectly decrypted or intentionally falsified. A way was needed to quickly verify the correctness of identification numbers.

A pocket-sized machine, the moon, was just about that. She worked on the basis of the checksum algorithm that he developed. For a 10-digit number, the computer performed the following actions:
  • double every second digit;
  • if any result is greater than or equal to 10, add the digits of this result to get a single-digit number (for example, “16” will become 1 + 6 = 7);
  • add all 10 digits of the new number;
  • multiply by 9;
  • take the last digit of this result.

This algorithm produced a unique verification number. In the original formulation of the moon, “0” meant that the original number was real. In later versions, the verification number was simply added to the original number in the form of the last digit, and it was easy to check whether the last digit corresponds to the result of the check on the Moon machine. The basic sequence of calculations, now known as the Moon Algorithm, is still widely used. This verifies the numbers of international mobile equipment identifiers (IMEIs) assigned to cell phones.

But it’s much more important that one of the most important algorithms of the digital age came from the gears and wheels of the Moon machine: hashing. This wide class of algorithms offers powerful means of organizing information so that a computer can easily find it. This is similar to a corned beef and potato recipe: a hash algorithm, like a cook, splits and mixes data in various ways. Such “confusion” with proper deployment can speed up many types of computer operations.

In early 1953, Lun sent an internal note to IBM, where he suggested putting information in “buckets” (bucket - bucket, basket) to speed up the search. Suppose you need to find the phone number in the database and find out who it belongs to. It consists of 10 numbers: "314-159-2652." The computer will be able to check one number from the database at a time until it finds the desired entry. However, if the database contains millions of numbers, it will take a lot of time.

The idea of ​​the Moon was to arrange all the entries in numbered baskets. This was done as follows: the digits of the phone number are grouped in pairs (in this case, 31, 41, 59, 26, 52). Then the pairs of numbers are added (4, 5, 14, 8, 7), and a new number is generated from them. If the result of the addition inside the pair contains two digits, only the last one is taken (that is, it turns out 45487). The original phone number and the name / address corresponding to it will be placed in the basket with the number 45487. The

search by phone number consisted in calculating the basket number (using the Moon method) and then extracting information from this basket. Even if the group contained several records, a search among them was much faster than a search in the entire database.

For decades, computer scientists and programmers have been perfecting the Moon's methods and finding new uses for them. But the basic idea remained the same: use math to organize the data into easy-to-find groups. Since the problems of organizing and searching for data are very common in computer technology, hashing algorithms have become indispensable in cryptography, graphics, telecommunications and biology. Every time you send a credit card number over the Internet or use a text editor dictionary, hash functions work.


Quick Indexing: At the 1958 International Conference on Scientific Information, Hans Peter Lun (right) demonstrates the IBM system for automatically creating document indices based on the KWIC algorithm he developed.

Moon Ideas in Computer Sciencewent far beyond the usual search. He realized that computers are capable of complex manipulations with text: reading and understanding written language. Subsequent indexing and organization of information to solve practical problems in science and business. By 1958, his chemical card sorter had become the Universal Card Scanner and the 9900 Special Index Analyzer, which were demonstrated at the Washington conference. These electromechanical devices could search and sort punch cards according to user criteria.

But most of the noise was made by KWIC, the Moon system for building concordances. Concordance is an alphabetical list of keywords used in a book or collection of works. It looks like a glossary, but it lists only words that actually appear in the text, and not concepts. Words that do not carry a semantic load, such as prepositions, conjunctions and articles, do not fall into concordance. Concordances have long been used in theology and philology. For example, the concordance of the Bible will indicate every use of the word “love” with reference to a book, chapter, and verse. Before full-text computerized search appeared, creating concordance was a time-consuming process. More often than not, concordance was created for “important” works, such as the Bible or Shakespeare’s collected works.

What the Moon system once did to search by numbers, KWIC did for texts. Both the one and the other made it possible to easily search through large volumes of information. Take a very simple example. Suppose you want to create a concordance from the words in the titles of the following four books: Gone With the Wind, War and Peace, Shadow of the Wind, and Shadow of War. (Note: in the original - Gone With the Wind, War and Peace, The Shadow of the Wind, and Shadows of War)



The KWIC algorithm will rearrange the words from the names in all possible orders, and then arrange each permutation alphabetically. The result will be a complete list of keywords (that is, everything except prepositions, conjunctions and articles) in the context in which they appeared.

The KWIC Moon system quickly found application in the scientific community. But he knew that it would be useful for business as well. In 1958, he wrote an article for the IBM Journal of Research and Development entitled “A Business Intelligence System”. In it, he proposed a system that could automatically generate abstracts of articles, extract key ideas from the abstracts, and then send the result to the appropriate employees of the company. Lun understood that solving the problem of information overload means developing a method for quickly sorting information that will not burden people with excess materials.

The New York Times in an obituary of 1964, Luna described his abstract system as follows:

Scientific American 2326 , IBM . , ...


Luna's abstraction program first calculated the frequency of all words in the article. After discarding the very common words, "abstract" found sentences in which several of the most frequent words occur together. Such proposals are considered representative and are placed in an abstract. It was a purely statistical method that made no attempt to "understand" the words in the article or the relationship between them. But, like KWIC, he showed that computers can be effectively used to organize text into an easy-to-read format.

Lun left IBM in 1961and three years later he died of leukemia. He did not live to see the day when major changes took place on the Internet. Excluding a small circle of information specialists, textile workers and historians, few people remember his name. But the ideas of the moon live. Today, hashing plays many roles in managing and protecting our digital lives. When you enter your password on a website, the server most likely saves a hashed version of the password. When you interact with the site using the secure https protocol or buy something using bitcoins, hashes also work there. With cloud services such as Dropbox and Google Drive, hashing helps make storage and file sharing much more efficient. In genetics and other high-tech research, hashing significantly reduces the time required to analyze huge amounts of data.

Hashes turned computers into “textual” tools that can be spoken with letters and words. Google Translate, Google N-gram, Google AdWords, and Google Search are all in one way or another dedicated to finding the meaning of the text. The information boom on the Internet has made automatic reading and understanding [of news and other information] a priority for business, for science, for everyone. The development of hashes has been associated with texts and is a reflection of Lun’s thoughts about words, sentences, concordances, excerpts, indices and digests.

This is Luna’s legacy: he helped show that calculators and computations are not only the domain of mathematics, statistics and logic, but also language, linguistics and literature. At that time, it was a revolutionary way of perceiving machines.

Technological historian Michael Mahoneycalled the computer “a steroid machine”: “not just one thing, but many things, a machine that can be sharpened for any purpose. Even now, we tend to view computers in the narrow sense as giant digital processors that perform millions of calculations and operations per second. Hans Peter Moon's gaze on computers extended further. Realizing that the computer is multifaceted, it has helped to open new, promising horizons for research. ”

All Articles