🔟 🏅 🖕🏽 OpenVINO Hackathon: Reconhecendo a voz e a emoção no Raspberry Pi ✌🏻 🚷 👧🏾

De 30 de novembro a 1º de dezembro, foi realizado o hackathon Nizhny Novgorod OpenVINO . Os participantes foram convidados a criar uma solução de protótipo de produto usando o kit de ferramentas Intel OpenVINO. Os organizadores propuseram uma lista de tópicos de amostra que poderiam ser orientados na escolha de uma tarefa, mas a decisão final permaneceu com as equipes. Além disso, o uso de modelos que não estão incluídos no produto foi incentivado.

No artigo, falaremos sobre como criamos nosso protótipo do produto, com o qual finalmente conquistamos o primeiro lugar.

10 . , . “ ”, , ! (, Intel ). 26 , . -, , , . , , , !

, Intel , Raspberry PI, Neural Compute Stick 2.

. -, , , .

, , , . , OpenVINO, , . — . . , OpenVINO , , :

, , , , .
, , , .

: retail . . - — .
, , . , , , !

Raspberry Pi 3 c Intel NCS 2.

NCS — CNN , , ̶̶̶̶̶̶̶ ̶̶ ̶̶̶̶̶̶̶ .

: . USB-, RPI. “ ”. Voice Bonnet Google AIY Voice Kit, .

Raspbian AIY projects , , ( 5 ):

arecord -d 5 -r 16000 test.wav

, . , alsamixer, Capture devices 50-60%.

-

AIY Voice Kit , RGB-, . “Google AIY Led” : https://aiyprojects.readthedocs.io/en/latest/aiy.leds.html
, 7 , 8 , !

GPIO Voice Bonnet, ( AIY projects)

from aiy.leds import Leds, Color
from aiy.leds import RgbLeds

C dict, RGB Tuple aiy.leds.Leds, :

led_dict = {'neutral': (255, 255, 255), 'happy': (0, 255, 0), 'sad': (0, 255, 255), 'angry': (255, 0, 0), 'fearful': (0, 0, 0), 'disgusted':  (255, 0, 255), 'surprised':  (255, 255, 0)} 
leds = Leds()

, , ( ).

leds.update(Leds.rgb_on(led_dict.get(classes[prediction])))

, !

pyaudio webrtcvad . , , .

webrtcvad — 10/20/30, ( ) 48, 48000×20/1000×1()=960 . Webrtcvad True/False , .

list , , , .
>=30 (600 ), , >250, , , , , .
< 30, 300, . ( )

 def to_queue(frames):
    d = np.frombuffer(b''.join(frames), dtype=np.int16)
    return d

framesQueue = queue.Queue()
def framesThreadBody():
    CHUNK = 960
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 48000

    p = pyaudio.PyAudio()
    vad = webrtcvad.Vad()
    vad.set_mode(2)
    stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)
    false_counter = 0
    audio_frame = []
    while process:
        data = stream.read(CHUNK)
        if not vad.is_speech(data, RATE):
            false_counter += 1
            if false_counter >= 30:
                if len(audio_frame) > 250:              
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []
                    false_counter = 0

        if vad.is_speech(data, RATE):
            false_counter = 0
            audio_frame.append(data)
            if len(audio_frame) > 300:                
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []

, github, , , . , , , OpenVINO — IR (Intermediate Representation). 5-7 github, , — .

— https://github.com/alexmuhr/Voice_Emotion
: , MFCC CNN
— https://github.com/linhdvu14/vggvox-speaker-identification
MFCC , FFT CNN, .

, . OpenVINO :

Open Model Zoo,
Model Optimzer, (Tensorflow, ONNX e.t.c) Intermediate Representation,
Inference Engine IR Intel, Myriad Neural Compute Stick

OpenCV ( Inference Engine)
IR : .xml .bin.
IR Model Optimizer :

python /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --input_model speaker.hdf5.pb --data_type=FP16 --input_shape [1,512,1000,1]

--data_type , . FP32, FP16, INT8. .
--input_shape . C++ API, .
IR DNN OpenCV forward .

import cv2 as cv
emotionsNet = cv.dnn.readNet('emotions_model.bin',
                          'emotions_model.xml')
emotionsNet.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)

Neural Compute Stick, , Raspberry Pi , .

: ( 0.4), MFCC, :

emotionsNet.setInput(MFCC_from_window)
result = emotionsNet.forward()

. , - , . , — . , . , .

, ( , , ).

python3 voice_db/record_voice.py test.wav

( )
fast fourier transform, numpy array (.npy):

for file in glob.glob("voice_db/*.wav"):
        spec = get_fft_spectrum(file)
        np.save(file[:-4] + '.npy', spec)

create_base.py
:

for file in glob.glob("voice_db/*.npy"):
    spec = np.load(file)
    spec = spec.astype('float32')
    spec_reshaped = spec.reshape(1, 1, spec.shape[0], spec.shape[1])
    srNet.setInput(spec_reshaped)
    pred = srNet.forward()
    emb = np.squeeze(pred)

, , cosine distance ( , ) — 0.3):

        dist_list = cdist(emb, enroll_embs, metric="cosine")
        distances = pd.DataFrame(dist_list, columns = df.speaker)

, 1-2 ( 7 2.5). -.

-

: , .

Raspberry Pi, websocket (http over tcp protocol).

, json , , . , . golang, , , .
, . , hub, ( ), ( ), , hub.

Front-end web-, JavaScript React . , , back-end Raspberry Pi. , react-router, , WebSocket. Raspberry Pi , probability . , , , , .

, , , , . , , , . — , . , , , , .

, 150$:

Raspberry Pi 3 ~ 35$
Google AIY Voice Bonnet ( respeaker) ~ 15$
Intel NCS 2 ~ 100$

— ,
:
()

: https://github.com/vladimirwest/OpenEMO

. . . , , AI .

OpenVINO Hackathon: Reconhecendo a voz e a emoção no Raspberry Pi

-

-

More articles: