🗃️ 👩🏻‍🍳 😹 Creating a navigator using augmented reality technologies and machine learning methods 🤷🏽 🍲 💥

Recently, a person’s ability to travel is growing incredibly fast, and we can discover more and more places. But it’s not always easy to navigate in these places, and sometimes the buildings have such complex architectural forms that you can even get lost in an unfamiliar place.

For example, our Engineering and Technology School No. 777, St. Petersburg, belongs to such buildings. We faced such a problem that when guests and parents visit our school, it is difficult for them to find the right office or other necessary place. Everyone finds their way out of this situation, someone endlessly asks building workers, and someone just wanders through endless corridors. Having analyzed this problem, we decided to make a navigator for our visitors for our school. But a simple navigator with a map inside the building is not relevant, which cannot be said about one of the new and rapidly developing technologies of augmented reality (AR). Our navigator was developed with AR technology, computer vision and machine learning methods. It will help you navigate and find the exact place where visitors to our school should go.

purpose

The goal is to create a universal assistant in building orientation.

Main goals:

To simplify the orientation of the school of new students, parents and guests;
Make the application universal so that later it supports navigation not only in our school, but also in any other buildings around the world;
User-friendly interface, understandable to everyone;

AR - Augmented Reality

Now every year more and more new technologies appear, one of such augmented reality technology or just AR. First, his very first acquaintance, happened when the well-known company Google decided to make his glasses with augmented reality. Then AR technology began to develop incredibly fast. When Google abandoned their Google glass project, a new era of masks came in that recognized our face and turned it into celebrities. Then the Pokémon captured both realities and people traveled kilometers in search of the coolest character.

More recently, Google and Apple introduced their ARC engines, ARCore and ARKit, respectively, from which we can conclude that AR technology will become even more accessible for the development and creation of more and more new applications and games.

What is AR?

Augmented reality is an environment that in real time complements the real world, as we see it with the help of digital data processed using any electronic computing devices and software.

Also, augmented reality (AR) must be distinguished from virtual (virtual reality, VR) and mixed (mixed reality, MR).

What are their main differences?

In AR, virtual digital objects are projected onto a real environment, unlike virtual reality.

Virtual reality is a world created by digital technical means, transmitted to a person through the senses.

Mixed reality is a cross between VR and AR and combines both approaches.

VR creates its own world, where a person plunges, and augmented reality works with the real physical world, introducing virtual objects into it. It follows that VR interacts exclusively with users, while AR interacts with the outside world.

Development prospects

AR technology may occupy the niche that science fiction has devoted to holograms. Only holograms will not be soon, and devices like Hololens (Microsoft Augmented Reality Glasses) are technically ready. The prospect of seeing virtual interactive illustrations in schools, which can be viewed from all sides, with which you can interact and immediately see the result of your experiences, seems far from beautiful bright fantasies about the future. Training in any engineering specialties can become much more visual and easy to understand, as well as interesting.

To summarize, the additional reality is not only games and cool masks for social networks. This is a huge number of opportunities in the use of AR in the fields of education, industry and medicine.

The growth of augmented reality is astounding. Unlike VR, she does not need to rely on massive helmets and powerful hardware, just the most compact and mobile device of our time - a smartphone.

Augmented reality is already changing our present: virtual masks, hunting for Pokemon in the cities and swamps, children shooting at each other not from pieces of wood, but through the phone screen. Now it is already a reality.

Work planning

Our navigator has a cabinet number recognition system, this recognition works using machine learning methods, and specifically, convolutional neural network technology was used. Let's dive into architecture.

For this model, we had to solve 2 main problems:

detecting letters in the image;
recognition of letters.

If for recognizing letters there are old - good LeNet-like networks, then for detecting letters in the picture we had to write our neural network architecture, compile a database and train the model. In order to compose the architecture, we needed to understand what kind of machine learning problem our model would solve.

After much debate, we came to a consensus - the text detector should solve the regression problem, i.e. the model should predict where the letter is potentially located. Then we simulated a neural network consisting of two layers, not including the input and output layers, but then we faced a problem: there was a lot of output data, because of this we had a problem - the problem of retraining, and we came to the conclusion that we we need a different architecture and approach to the problem, after a long discussion we came to the conclusion that we need to detect not letters at once, but first words, and only then letters. And here we spent quite a lot of time changing the database and then added another neural network to the cascade, because of this our coefficient (we chose the Mean Squared Error metric) decreased by 40%, but still the actual result was far from the desired one,therefore, we decided to rewrite the architecture of 2 neural networks for this we had to disassemble the “ensemble” and understand how each network works separately and understand why we need each parameter for each network. Then we realized that instead of reducing the number of parameters, we can change the approach to the problem, and our team changed the model: now this is one convolutional neural network, with this approach, we reduced the error rate by 30% and the speed of run through the network increased.now this is one convolutional neural network, with this approach we have reduced the error rate by 30% and the speed of run through the network has increased.now this is one convolutional neural network, with this approach we have reduced the error rate by 30% and the speed of run through the network has increased.

The problem of character detection is solved, it remains to solve the problem of character recognition, character recognition itself was not a new task for our team, we trained a convolutional neural network that was similar in architecture to the LeNet neural network and received a 99% error coefficient on the test dataset (i.e., a neural network almost no mistake). But in practice, this turned out to be an unacceptable solution, since a fixed-resolution picture should go to the input of this network, but in practice, different pictures with different resolutions can come in, and then our team thought about it and decided to train the same neural network, but only in pictures with enough a large resolution, so large that it could be accurately assumed that there would be no greater permission to enter the neural network,but then we abandoned this idea because we are trying to change the resolution of the image supplied to the input of the architecture which does not imply a change in image resolution, so now we need to think about the architecture of recognizing letters, and we realized that the best solution was to take as a basis architecture - a descendant of the LeNet neural network - ResNet from Microsoft, this architecture implies that it will receive images with different resolutions.that it will receive images with different resolutions.that it will receive images with different resolutions.

As a result, after we trained the neural network, the error rate for us became 95% - yes, it’s a little less than with the LeNet architecture but it works, then we remembered that we did not need to recognize all the letters, but only the numbers and decided to add before determining symbol is another model that will determine whether the symbol to be determined is a number or a letter, since this model solves the classification problem and classifies only 2 classes, the neural network was meaningless and therefore we took the logistic regression as the basis of architecture, after adding this model to “Ensemble” and retraining of the neural network (only by numbers), our error rate was + - 97% which is not very bad, on this “note” we decided to finish work on character recognition and, as a result, we got such an “ensemble”:

Original picture -> Convolutional neural network that determines the location of words -> Convolutional neural network that determines the location of letters -> Logistic regression, which determines the letter in the picture or number -> ResNet, which determines the final digit

But then we had one more problem: the problem with tracking geolocation. At first we thought that this was one of the easiest tasks and turned out to be wrong. First of all, we, like real programmers, took the path of least resistance and simply tracked the GPS geolocation at the object - it didn’t work, the error was so big that it didn’t suit us at all, we decided to think about other options for determining geolocation and decided to connect the service from the company Yandex - Yandex Latitude, but it turned out to be not working, i.e. it’s not a working service in which you can’t connect the API, our next thought was to connect the mapbox framework for the Unity engine (it was on it that we decided to develop our project), after we successfully connected this framework, examined its capabilities and wrote a simple GPS tracker - the result was already betterbut the error was still large, because of this we continued to search for solutions, and after a careful reading of the documentation, we found that the obtained value had an error of 15%, this did not suit us, especially since this approach had a very big drawback - it used GPS , and we are making a navigator around the school, it means indoors, then the GPS signal will either be very weak or not at all + GPS is not supported on all devices, which is why we changed the approach and moved from global recognition to local based on Visual Positioning System This method also has its drawbacks, but they did not affect our work in any way, so the geolocation tracking system was ready.that there is an error of 15% for the obtained value, this did not suit us, especially since this approach had a very big drawback - it used GPS, and we do the navigator around the school, then indoors, then the GPS signal will either be very weak or will not In general, + GPS is not supported on all devices, which is why we changed the approach and moved from global recognition to local based on the Visual Positioning System. This method also has its drawbacks, but they did not affect our work in any way, so the geolocation tracking system was ready.that there is an error of 15% for the obtained value, this did not suit us, especially since this approach had a very big drawback - it used GPS, and we do the navigator around the school, then indoors, then the GPS signal will either be very weak or will not In general, + GPS is not supported on all devices, which is why we changed the approach and moved from global recognition to local based on the Visual Positioning System. This method also has its drawbacks, but they did not affect our work in any way, so the geolocation tracking system was ready.then the GPS signal will either be very weak or not at all + GPS is not supported on all devices, which is why we changed the approach and switched from global recognition to local based on the Visual Positioning System, this method also has its drawbacks, but they did not affect our work, so the geolocation tracking system was ready.then the GPS signal will either be very weak or not at all + GPS is not supported on all devices, which is why we changed the approach and switched from global recognition to local based on the Visual Positioning System, this method also has its drawbacks, but they did not affect our work, so the geolocation tracking system was ready.

Practical significance

Our application serves to better orient people who have come for the first time or who simply have a poor understanding of the structure of the school of people in our educational institution. They will come and meet at the entrance a stand with a QR code, when reading which the application is downloaded.

Functional

What our application can do:

recognize the cabinet number by the plate,
detailed school map,
lead a person to his office,
search for an office by its number,
search for the classroom in the list and by its type (all classrooms are distributed in certain groups depending on what subject is taught in this room).

Application implementation

The application was written on one of the most popular engines called “Unity”, using the C # programming language, as well as using Python and the TensorFlow library, besides all of the above, the Anaconda package manager was used.

How to use this application?

Suppose I am a parent and came to a parent's meeting, I need to get into room 339, an informatics office. I went to school and must scan, as I repeat the QR code from the tablet. Then, after successfully loading the application, I will get to the main menu:

After that, I can go to the instructions menu to find out how to use the application:

This instruction describes in detail how to find the class, i.e. first, in the main menu, we click “Find Class”, then the following menus are displayed: The

arrow in the upper right corner allows us to go back to the main menu, if the system recognized the cabinet number, a frame appears around the plate, and then a confirmation window:

If the model correctly recognized the cabinet, then our conditional “parent” will click “yes”, if the cabinet number is not recognized correctly, then click “no” and take the picture again. So the application will understand where we are in the school space.

Next, the “parent” must choose a destination and here he has 2 options, either enter the number of the office, or select its type and find it in the list later:

After that, the “parent” must follow the signs on the camera.

When you arrive at your destination, the navigator will notify you of this.

There is another option, more familiar, the “parent” clicks on the main menu - “School Plan” and a detailed plan opens, where you can independently navigate how to get to the desired place.

Floor Plan 1 Floor

Plan 2

At the moment, this is all the functionality present in the application, in the future it is planned to finalize the application and add even more interesting functions.

You can watch the video at the link below:

Conclusion

With well-coordinated teamwork, sufficient time and skills to apply soft skills, any project can be implemented.

Worked on the application:

Ilya Vasilenko - programmer (backend)

Parfenyev Demid - Designer

Lunev Daniil - Unity developer (frontend)

Mikhail Purtov - Data Miner, creator

Thanks to our project, each member of our team learned a lot of new things. From design to programming and computer vision.

Our PYC team believes that Habr will love our project.

Regards, PYC Team

Tools

Unity + C #

Python + Anaconda package manager + TensorFlow library for creating neural networks + library for creating machine learning models scikit-learn

Immersal - A library that implements VPS

Bibliography

Tensorflow
website Immersal
website C #
website Python site

* This publication does not pretend to advertise our application, because it is not even available in open sources and it works only within the school

Creating a navigator using augmented reality technologies and machine learning methods