Search for anagrams and sabanagrams in all words of the language

Solving problems with anagrams led to the idea:
How many words remain if you remove all anagrams and sangangrams from the dictionary of the Russian language

The dictionary found contains more than 1.5 million words in various forms.

You can compare each word with each, but for 1.5 million entries it is long and not optimal.
In a world with infinite memory, you can generate substrings of all permutations of each word and check our dictionary on them.

But is there a better solution ?

Let's start with the terminology:

Anagram - a word obtained by rearrangement of letters
Example: rocket and carriage

Sabanagram - a word that can be composed of letters of another word
Example: arch - sabanagram of the word rocket

Task :

Let's say our dictionary consists of five tricky words:

rocket , carriage , arch , cat , sputum

Add a dictionary to the prefix tree (Trie)
Each tree node contains a pair: letter + its number in the word
The nodes are sorted alphabetically and by the frequency of the letter in the word



Algorithm (in simplified form):

Take the floor, n.r. cat:

We are looking for nodes starting with the minimum letter of the word (“k”)

(In the figure, such nodes are marked in purple)

As soon as we find such a node, we look in the subtree for the path containing the remaining letters in the required quantity

In the word sputum below node K-1 there is O-2 and T-1, which is enough for our word cat.

The advantage of such a data structure is that we can quickly exit the subtree if the node letter> than the letter we are

looking at. Having checked our dictionary, we found out that only sputum is not an anagram or sabanagram of another word

Java code
 public boolean isAnagramOrSubAnagram(Word word) {
        Character minCharacter = word.getMinCharacter();

        Stack<TrieNode> stack = new Stack<>();
        stack.add(root);

        while (!stack.isEmpty()) {
            TrieNode node = stack.pop();

            for (Entry<TrieKey, TrieNode> entry : node.getChildren().entrySet()) {
                char character = entry.getKey().getCharacter();
                if (character < minCharacter) {
                    stack.add(entry.getValue());
                } else if (character > minCharacter) {
                    break;
                } else if (entry.getKey().getCount() >= word.getCharacterCount(minCharacter)) {
                    if (doesMinWordCharacterNodeContainAnagram(entry, word)) {
                        return true;
                    }
                }
            }
        }
        return false;
    }



→ The full version of the code with two dictionaries and tests

PS

For a Russian dictionary from 1.5 million, 242399 words left in 13 minutes
For an English dictionary from 416 thousand left 49251 in 45 seconds

You can optimize the original dictionary by removing the current word from it if the next one starts with it

All Articles