💤 👨🏾‍🍳 🍴 OpenVINO Hackathon: Recognizing Voice and Emotion on the Raspberry Pi ♌️ 🕴🏼 🏦

November 30 - December 1 in Nizhny Novgorod OpenVINO hackathon was held . Participants were asked to create a prototype product solution using the Intel OpenVINO toolkit. The organizers proposed a list of sample topics that could be guided by when choosing a task, but the final decision remained with the teams. In addition, the use of models that are not included in the product was encouraged.

In the article we’ll talk about how we created our prototype of the product, with which we eventually won first place.

The hackathon involved more than 10 teams. It is nice that some of them came from other regions. The venue “The Kremlin on the Pochain” was chosen as the venue for the hackathon, where old photographs of Nizhny Novgorod were hung inside, entourage! (I remind you that at the moment, Intel's central office is located in Nizhny Novgorod). Participants were given 26 hours to write the code, at the end it was necessary to present their decision. A separate plus was the presence of a demo session to make sure that everything conceived by the truth was implemented, and did not remain ideas in the presentation. Merch, snacks, food, everything was there too!

In addition, Intel optionally provided cameras, Raspberry PI, Neural Compute Stick 2.

Task selection

. -, , , .

, , , . , OpenVINO, , . — . . , OpenVINO , , :

, , , , .
, , , .

: retail . . - — .
, , . , , , !

Raspberry Pi 3 c Intel NCS 2.

NCS — CNN , , ̶̶̶̶̶̶̶ ̶̶ ̶̶̶̶̶̶̶ .

: . USB-, RPI. “ ”. Voice Bonnet Google AIY Voice Kit, .

Raspbian AIY projects , , ( 5 ):

arecord -d 5 -r 16000 test.wav

, . , alsamixer, Capture devices 50-60%.

-

AIY Voice Kit , RGB-, . “Google AIY Led” : https://aiyprojects.readthedocs.io/en/latest/aiy.leds.html
, 7 , 8 , !

GPIO Voice Bonnet, ( AIY projects)

from aiy.leds import Leds, Color
from aiy.leds import RgbLeds

C dict, RGB Tuple aiy.leds.Leds, :

led_dict = {'neutral': (255, 255, 255), 'happy': (0, 255, 0), 'sad': (0, 255, 255), 'angry': (255, 0, 0), 'fearful': (0, 0, 0), 'disgusted':  (255, 0, 255), 'surprised':  (255, 255, 0)} 
leds = Leds()

, , ( ).

leds.update(Leds.rgb_on(led_dict.get(classes[prediction])))

, !

pyaudio webrtcvad . , , .

webrtcvad — 10/20/30, ( ) 48, 48000×20/1000×1()=960 . Webrtcvad True/False , .

list , , , .
>=30 (600 ), , >250, , , , , .
< 30, 300, . ( )

 def to_queue(frames):
    d = np.frombuffer(b''.join(frames), dtype=np.int16)
    return d

framesQueue = queue.Queue()
def framesThreadBody():
    CHUNK = 960
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 48000

    p = pyaudio.PyAudio()
    vad = webrtcvad.Vad()
    vad.set_mode(2)
    stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)
    false_counter = 0
    audio_frame = []
    while process:
        data = stream.read(CHUNK)
        if not vad.is_speech(data, RATE):
            false_counter += 1
            if false_counter >= 30:
                if len(audio_frame) > 250:              
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []
                    false_counter = 0

        if vad.is_speech(data, RATE):
            false_counter = 0
            audio_frame.append(data)
            if len(audio_frame) > 300:                
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []

, github, , , . , , , OpenVINO — IR (Intermediate Representation). 5-7 github, , — .

— https://github.com/alexmuhr/Voice_Emotion
: , MFCC CNN
— https://github.com/linhdvu14/vggvox-speaker-identification
MFCC , FFT CNN, .

, . OpenVINO :

Open Model Zoo,
Model Optimzer, (Tensorflow, ONNX e.t.c) Intermediate Representation,
Inference Engine IR Intel, Myriad Neural Compute Stick

OpenCV ( Inference Engine)
IR : .xml .bin.
IR Model Optimizer :

python /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --input_model speaker.hdf5.pb --data_type=FP16 --input_shape [1,512,1000,1]

--data_type , . FP32, FP16, INT8. .
--input_shape . C++ API, .
IR DNN OpenCV forward .

import cv2 as cv
emotionsNet = cv.dnn.readNet('emotions_model.bin',
                          'emotions_model.xml')
emotionsNet.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

Neural Compute Stick, , Raspberry Pi , .

: ( 0.4), MFCC, :

emotionsNet.setInput(MFCC_from_window)
result = emotionsNet.forward()

. , - , . , — . , . , .

, ( , , ).

python3 voice_db/record_voice.py test.wav

( )
fast fourier transform, numpy array (.npy):

for file in glob.glob("voice_db/*.wav"):
        spec = get_fft_spectrum(file)
        np.save(file[:-4] + '.npy', spec)

create_base.py
:

for file in glob.glob("voice_db/*.npy"):
    spec = np.load(file)
    spec = spec.astype('float32')
    spec_reshaped = spec.reshape(1, 1, spec.shape[0], spec.shape[1])
    srNet.setInput(spec_reshaped)
    pred = srNet.forward()
    emb = np.squeeze(pred)

, , cosine distance ( , ) — 0.3):

        dist_list = cdist(emb, enroll_embs, metric="cosine")
        distances = pd.DataFrame(dist_list, columns = df.speaker)

, 1-2 ( 7 2.5). -.

-

: , .

Raspberry Pi, websocket (http over tcp protocol).

, json , , . , . golang, , , .
, . , hub, ( ), ( ), , hub.

Front-end web-, JavaScript React . , , back-end Raspberry Pi. , react-router, , WebSocket. Raspberry Pi , probability . , , , , .

, , , , . , , , . — , . , , , , .

, 150$:

Raspberry Pi 3 ~ 35$
Google AIY Voice Bonnet ( respeaker) ~ 15$
Intel NCS 2 ~ 100$

— ,
:
()

: https://github.com/vladimirwest/OpenEMO

. . . , , AI .

OpenVINO Hackathon: Recognizing Voice and Emotion on the Raspberry Pi

Task selection

-

-

More articles: