🕜 🌁 🌯 OpenVINO Hackathon: Erkennen von Stimme und Emotion auf dem Raspberry Pi 🏇 📚 🚆

30. November - 1. Dezember in Nischni Nowgorod OpenVINO Hackathon wurde abgehalten . Die Teilnehmer wurden gebeten, mit dem Intel OpenVINO-Toolkit einen Prototyp einer Produktlösung zu erstellen. Die Organisatoren schlugen eine Liste von Beispielthemen vor, an denen sich die Auswahl einer Aufgabe orientieren konnte. Die endgültige Entscheidung blieb jedoch bei den Teams. Darüber hinaus wurde die Verwendung von Modellen empfohlen, die nicht im Produkt enthalten sind.

In dem Artikel werden wir darüber sprechen, wie wir unseren Prototyp des Produkts erstellt haben, mit dem wir schließlich den ersten Platz gewonnen haben.

10 . , . “ ”, , ! (, Intel ). 26 , . -, , , . , , , !

, Intel , Raspberry PI, Neural Compute Stick 2.

. -, , , .

, , , . , OpenVINO, , . — . . , OpenVINO , , :

, , , , .
, , , .

: retail . . - — .
, , . , , , !

Raspberry Pi 3 c Intel NCS 2.

NCS — CNN , , ̶̶̶̶̶̶̶ ̶̶ ̶̶̶̶̶̶̶ .

: . USB-, RPI. “ ”. Voice Bonnet Google AIY Voice Kit, .

Raspbian AIY projects , , ( 5 ):

arecord -d 5 -r 16000 test.wav

, . , alsamixer, Capture devices 50-60%.

-

AIY Voice Kit , RGB-, . “Google AIY Led” : https://aiyprojects.readthedocs.io/en/latest/aiy.leds.html
, 7 , 8 , !

GPIO Voice Bonnet, ( AIY projects)

from aiy.leds import Leds, Color
from aiy.leds import RgbLeds

C dict, RGB Tuple aiy.leds.Leds, :

led_dict = {'neutral': (255, 255, 255), 'happy': (0, 255, 0), 'sad': (0, 255, 255), 'angry': (255, 0, 0), 'fearful': (0, 0, 0), 'disgusted':  (255, 0, 255), 'surprised':  (255, 255, 0)} 
leds = Leds()

, , ( ).

leds.update(Leds.rgb_on(led_dict.get(classes[prediction])))

, !

pyaudio webrtcvad . , , .

webrtcvad — 10/20/30, ( ) 48, 48000×20/1000×1()=960 . Webrtcvad True/False , .

list , , , .
>=30 (600 ), , >250, , , , , .
< 30, 300, . ( )

 def to_queue(frames):
    d = np.frombuffer(b''.join(frames), dtype=np.int16)
    return d

framesQueue = queue.Queue()
def framesThreadBody():
    CHUNK = 960
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 48000

    p = pyaudio.PyAudio()
    vad = webrtcvad.Vad()
    vad.set_mode(2)
    stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)
    false_counter = 0
    audio_frame = []
    while process:
        data = stream.read(CHUNK)
        if not vad.is_speech(data, RATE):
            false_counter += 1
            if false_counter >= 30:
                if len(audio_frame) > 250:              
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []
                    false_counter = 0

        if vad.is_speech(data, RATE):
            false_counter = 0
            audio_frame.append(data)
            if len(audio_frame) > 300:                
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []

, github, , , . , , , OpenVINO — IR (Intermediate Representation). 5-7 github, , — .

— https://github.com/alexmuhr/Voice_Emotion
: , MFCC CNN
— https://github.com/linhdvu14/vggvox-speaker-identification
MFCC , FFT CNN, .

, . OpenVINO :

Open Model Zoo,
Model Optimzer, (Tensorflow, ONNX e.t.c) Intermediate Representation,
Inference Engine IR Intel, Myriad Neural Compute Stick

OpenCV ( Inference Engine)
IR : .xml .bin.
IR Model Optimizer :

python /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --input_model speaker.hdf5.pb --data_type=FP16 --input_shape [1,512,1000,1]

--data_type , . FP32, FP16, INT8. .
--input_shape . C++ API, .
IR DNN OpenCV forward .

import cv2 as cv
emotionsNet = cv.dnn.readNet('emotions_model.bin',
                          'emotions_model.xml')
emotionsNet.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

Neural Compute Stick, , Raspberry Pi , .

: ( 0.4), MFCC, :

emotionsNet.setInput(MFCC_from_window)
result = emotionsNet.forward()

. , - , . , — . , . , .

, ( , , ).

python3 voice_db/record_voice.py test.wav

( )
fast fourier transform, numpy array (.npy):

for file in glob.glob("voice_db/*.wav"):
        spec = get_fft_spectrum(file)
        np.save(file[:-4] + '.npy', spec)

create_base.py
:

for file in glob.glob("voice_db/*.npy"):
    spec = np.load(file)
    spec = spec.astype('float32')
    spec_reshaped = spec.reshape(1, 1, spec.shape[0], spec.shape[1])
    srNet.setInput(spec_reshaped)
    pred = srNet.forward()
    emb = np.squeeze(pred)

, , cosine distance ( , ) — 0.3):

        dist_list = cdist(emb, enroll_embs, metric="cosine")
        distances = pd.DataFrame(dist_list, columns = df.speaker)

, 1-2 ( 7 2.5). -.

-

: , .

Raspberry Pi, websocket (http over tcp protocol).

, json , , . , . golang, , , .
, . , hub, ( ), ( ), , hub.

Front-end web-, JavaScript React . , , back-end Raspberry Pi. , react-router, , WebSocket. Raspberry Pi , probability . , , , , .

, , , , . , , , . — , . , , , , .

, 150$:

Raspberry Pi 3 ~ 35$
Google AIY Voice Bonnet ( respeaker) ~ 15$
Intel NCS 2 ~ 100$

— ,
:
()

: https://github.com/vladimirwest/OpenEMO

. . . , , AI .

OpenVINO Hackathon: Erkennen von Stimme und Emotion auf dem Raspberry Pi

-

-

More articles: