🕵🏽 🚅 🙆 How we test microphone systems on STM32: the experience of Yandex device developers 🏏 🏳️ 🕋

Hi, I’m Gennady “Crail” Kruglov from the Yandex hardware solutions team.

The selection of microphones for the microphone matrix is a complex and interesting part of our work: we test models with various parameters, experiment with various matrix configurations, and improve sound processing algorithms.

It’s convenient for developers who create echo and noise reduction algorithms to not only process raw data that were previously taken from a device in the laboratory, but also to interact, for example, with a new microphone matrix in real time by connecting it to their laptop.

It seems uncomplicated only at first glance. In this article I will explain how we solved the problem of transferring sound from seven microphones with a PDM interface to a computer via USB, what hardware and software nuances we encountered and how to overcome them (spoiler: this approach can be adapted for matrices with the number of microphones ≤ 8 ) At the end of the post I will share a link to the stream, where I show the development process on the STM32 microcontroller, and talk about the next series.

Formulation of the problem

A little background: to create a controlled beam of sensitivity, for the first Yandex.Station, a circuit with seven microphones (analog) was selected, for the Mini version - with four (already digital). For other products, various configurations are considered, but still the seven-microphone matrix for us is basic, classic.

So, given: seven digital microphones, the need to test them. Find: not too difficult to implement and flexible way to interact with them. It is logical to divide the task into two:

1. Get data from microphones.

2. Send them to a computer.

In the finished device, when the user contacts Alice, the signals from digital microphones are sent directly to the central processor (it is more correct to call it SoC - System-on-Chip, but the “processor” is more familiar and convenient), it has sufficient power to process them. But for debugging algorithms it is much more convenient to get this data directly to the developer's computer. The easiest way is to connect via USB: thus, the board must have a microcontroller with the appropriate unit. We love the STM32 controller, but it is impossible to send the sound stream from the microphones directly to it: there is no PDM signal reception unit (pulse density modulation) - the output interface of digital microphones.

Another option is to connect the microphone board to the debug board from the manufacturer of the SoC used. But this decision is tied to Linux alsamixer, and its parameters strongly affect the result of converting PDM to PCM. These blocks may differ not only for processors from different manufacturers, but even for two models of the same vendor. I remind you that we needed a simple solution, transparent and predictable.

Hardware solution

Accept the inability of the STM32 to accept multi-channel PDM. One could use the SPI block to receive a PDM signal, but only one microphone can be connected to one SPI bus. We work with the STM32L476RC controller, where there are only three such buses. Additional complexity: the PDM signal is quite high-frequency, it is necessary to do its decimation, averaging, processing, filtering - for seven microphones this task is quite complicated.

Since we are talking about a debug board, and not a prototype for mass production, we will focus on a specialized chip TSDP18xx. It does everything necessary: it generates the necessary frequencies and signals for PDM, averages and processes the PDM signal, turns it all into an I2S signal. More precisely, TDM (Time Division Multiplexing), because the I2S-bus assumes two channels, and if you drive more through the same wires, it is no longer quite correct to call it I2S.

The advantage of this approach is that all the work on preparation and averaging is undertaken by TSDP. Minus - all algorithms are tightly wired inside this microcircuit, and they cannot be changed. In particular, you cannot adjust the volume by modifying the averaging parameters. But for debugging, this is not critical.

Watch your hands: there are seven microphones, eight channels on the microcircuit. The one that is not used, the output is still there, so in the future for simplicity I will talk about the eight-channel audio stream.

So, we raise the eight-channel TDM to STM32, we get an eight-channel audio stream. How data moves:

SAI - STM32 hardware unit for working with I2S / TDM. It is very flexible and allows you to implement many protocol options. But because of this, it is easy to get confused in the requirements for frequencies.

The clock tree deserves a closer look. A 12 MHz quartz resonator is connected to the microcontroller. We divide this frequency before applying to the PLL blocks by 3 and get 4 MHz. Then it works like this:

1. It would be nice to make the core frequency higher to keep up with everything: for example, the maximum for this controller is 80 MHz. We use the first PLL block: we multiply 4 MHz by 40 and divide by 2.

2. USB requires 48 MHz. To do this, use the second PLL block: multiply 4 MHz by 24 and divide by 2.

3. About microphones. Our test boards use a sampling frequency of Fs = 16 kHz, a standard adopted in the field of speech recognition. From the initial frequency of 4 MHz you need to get something that can be turned into 16 kHz TDM bus frame frequencies (aka LRCK, aka FCK, aka FrameSync). In this case:

[frequency of bit synchronization (BCLK, BitClk, Sync, SCK)] = Fs ∙ [number of channels] ∙ [number of bits per channel]

That is: SCK = 16 kHz ∙ 8 ∙ 16 = 2048 kHz.

4. The datasheet indicates that the ratio between Master Clock and sampling rate Fs is as follows: MasterClock = 16 kHz ∙ Divider MCLK ∙ 256. Here 256 is a constant, and the divisor can be set in the register. Let's check the scheme - for the necessary functionality there are coefficients for dividing the PLL frequency by 7 or 17:

To summarize the problem: you need to select such a set of PLL and SAI factors and dividers to get a sampling frequency of 16 kHz and a bit frequency of 128 times more. Since the set had an obligatory divisor by 7 (or 17), it did not work to get exactly the desired result. I had to build a table of multipliers and dividers to get 24.571 MHz. Dividing this frequency by 6 (MCLK Divider), and then by 256 (constant), finally, we get a number close enough to 16 kHz. Now I will explain why this is so important.

USB operation

USB uses an isochronous type of transfer to work with multimedia data: in this case, a certain bandwidth and delay value are guaranteed on the USB bus. Data delivery is not guaranteed: if a packet arrives with a failure, then it will be considered lost. This is due to strict time limits: there is no way to ask again.

With the isochronous type of transfer at USB FullSpeed speed (it's 12 Mbit / s; it is at this speed that the STM32 USB block can work) the computer comes to the device for data every millisecond: after this period of time, it should collect the accumulated data. Let me remind you the introductory ones: the sampling frequency is 16 kHz, 8 channels, each channel requires two bytes, because the sound is sixteen-bit. Total 16000 ∙ 2 ∙ 8/1000 = 256 bytes per millisecond. The size of one packet for an isochronous type of transmission can reach 1023 bytes, so there are no problems at this point.

So, the packet size is 256 bytes. It would seem that all is well. Sixteen times received data on TDM, put into the buffer, USB came, we give it a packet, we repeat ... But this only happens in an ideal world. The problem is that on the one hand we have imperfect 16 kHz (a little less), and as a result, the data comes in a little less than once every millisecond. On the other hand, the millisecond of the computer also floats, since it is busy: when it could, then it came. That is, the microphone polling frequency differs from 16 kHz (but always the same), and the USB millisecond also differs in length (the difference, most likely, is floating: it turns out a little more, then a little less than an ideal millisecond).

Why is this a problem? You can lose the package. It is probably unnecessary to explain that complete data is necessary for the correct debugging of the algorithms. How the packet is lost: they accumulated 256 bytes of results, put them in the buffer, and continued the measurement. A computer came, took the first 256, we still continue to measure. The computer came again, but the measurement has not yet been completed - the computer left with an empty package. Then we finish filling the buffer and start filling out another one, the next one, until the computer arrives again. The computer takes only the last packet; as a result, one packet is lost.

The problem is, in fact, known. There are three approaches to dealing with it:

. USB. — . «» — . USB . , , ( , 16 ), . , .
. .
Asynchronous is the best for this task. The device has a stable frequency generator. The sampling rate is maintained exactly the same without reference to USB. In this case, you need to transfer data to the device so that there are no significant discrepancies.

All this has been discussed more than once on the Internet for the case of playback from a computer to the speaker through a device with a digital-to-analog coder, where the device as a feedback tells you how many sampling periods have come since the last packet was received.

But our task is the opposite, debugging requires receiving data from microphones to a computer, and the question of recording a signal from microphones to a computer is only mentioned at best. Why not do the same: introduce feedback from the computer? There is an easier option.

There he is

We use the frequent addition of samples and two buffers to store data for sending. 16 times per millisecond we add to the selected buffer the next sample. At some point in time, an interruption occurs: USB took the previous packet. If buffer No. 1 is full, it switches to buffer No. 2. When USB arrives for the next packet, it is already prepared. Send buffer number 2 and switch back to number 1.

USB comes for data at different points in time, the package includes a different number of samples. It can turn out to be more and less than sixteen, so there is a chance to exceed a packet of 256 bytes in size, it is better to leave space for maneuver. Let it be 384 = 256 + 128: this will give a margin of half a millisecond, that is, it will forgive the swimming phase of the USB signal by 50% - such a margin should be more than enough. Total: sometimes more or less 256 bytes are sent, but never an empty packet, which avoids data loss. That is, the problem of unevenness was solved by increasing the package, at the cost of increasing part of the bus bandwidth allocated for our device and reducing this part for other devices.

On this, the delivery of data to the computer came to an end. Developers can be debugged, and you can ask questions in the comments if some kind of data packet was not enough for a complete understanding.

My streams and the next episode

Lately I streamed twice from my home soldering lab. At first I just showed the soldering process and told which devices I use. The second series was just devoted to development on the STM32.

Streams continue. This Friday at 19:00, my colleague from the hardware solutions development team Andrey Laptev will arrange an online analysis of Yandex.Stations Mini - show the insides and share production histories. For more fun, Andrey will screw the battery to the column - not all the same, work from the wire. In the final, you will receive a guide that will allow you to repeat this experience yourself or come up with a more interesting design.

Sign upto watch the stream. You will receive a letter with a file for the calendar and a reminder on the air day. Thank you for reading!

How we test microphone systems on STM32: the experience of Yandex device developers

Formulation of the problem

Hardware solution

USB operation

There he is

My streams and the next episode

More articles: