Video calls with virtual background and open source tools

Now that many of us are quarantined due to COVID-19 , video calls have become a much more frequent occurrence than before. In particular, the ZOOM service suddenly became very popular. Probably the most interesting Zoom feature is support for the Virtual Background . It allows users to interactively replace the background behind them with any image or video.



I have been using Zoom at work for a long time, at open-source meetings on Kubernetes, usually doing this from a corporate laptop. Now, when I’m working from home, I am inclined to use a more powerful and convenient personal desktop computer to solve some of my open source tasks.

Unfortunately, Zoom only supports a background removal method known as “ chroma key ” or “ green screen ”. To use this method, it is necessary that the background be represented by some solid color, ideally green, and be uniformly lit.

Since I do not have a green screen, I decided to simply implement my own background removal system. And this, of course, is much better than putting things in order in the apartment, or the constant use of a work laptop.

As it turned out, using ready-made open source components and writing just a few lines of your own code, you can get very decent results.

Reading camera data


Let's start from the beginning and answer the following question: "How to get video from a webcam that we will process?"

Since I use Linux on my home computer (when I’m not playing games), I decided to use the Open CV Python bindings that I’m already familiar with. In addition to V4L2 -bindings for reading data from a webcam, they include useful basic video processing functions. Reading a frame from a webcam in python-opencv is very simple:



import cv2
cap = cv2.VideoCapture('/dev/video0')
success, frame = cap.read()

To improve the results when working with my camera, I applied the following settings before capturing video from it:

#    720p @ 60 FPS
height, width = 720, 1280
cap.set(cv2.CAP_PROP_FRAME_WIDTH ,width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT,height)
cap.set(cv2.CAP_PROP_FPS, 60)

It seems like most video conferencing programs limit video to 720p @ 30 FPS, or lower. But we, in any case, may not read every frame. Such settings set the upper limit.

Put the frame capture mechanism in a loop. Now we have access to the video stream from the camera!

while True:
    success, frame = cap.read()

You can save the frame for test purposes as follows:

cv2.imwrite("test.jpg", frame)

After that, we can make sure that the camera is working. Great!


I hope you are not against my beard

Background detection


Now that we have access to the video stream, we’ll think about how to detect the background by making it possible to replace it by finding it. But this is already a rather difficult task.

Although there is a feeling that the creators of Zoom never talk about exactly how the program removes the background, the way the system behaves makes me think about what could have done without neural networks. It's hard to explain, but the results look exactly like that. In addition, I found an article on how Microsoft Teams implements background blur using a convolutional neural network .

In principle, creating your own neural network is not so difficult. There are many articles and scientific papers on image segmentation.. There are lots of open source libraries and tools. But we need a very specialized dataset to get good results.

In particular, we need a lot of images resembling those obtained from a webcam, with a perfect picture of a person in the foreground. Each pixel of such a picture should be marked as different from the background.

Building such a dataset in preparation for training a neural network may not require much effort. This is due to the fact that the team of researchers from Google has already done all the hardest and put into the open source a pre-trained neural network for segmenting people. This network is called BodyPix . It works very well!

BodyPix is ​​now only available in a form suitable for TensorFlow.js. As a result, it is easiest to apply using the body-pix-node library .

To speed up the network output (forecast) in the browser, it is preferable to use the WebGL backend, but in the Node.js environment you can use the Tensorflow GPU backend (note that for this you will need a video card from NVIDIA , which I have).

In order to simplify the project setup, we will use a small containerized environment that provides the TensorFlow GPU and Node.js. Using it all with nvidia-docker- much easier than collecting the necessary dependencies on your computer yourself. To do this, you only need Docker and the latest graphics drivers on your computer.

Here is the contents of the file bodypix/package.json:

{
    "name": "bodypix",
    "version": "0.0.1",
    "dependencies": {
        "@tensorflow-models/body-pix": "^2.0.5",
        "@tensorflow/tfjs-node-gpu": "^1.7.1"
    }
}

Here is the file bodypix/Dockerfile:

#  ,   TensorFlow GPU
FROM nvcr.io/nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04
#  node
RUN apt update && apt install -y curl make build-essential \
    && curl -sL https://deb.nodesource.com/setup_12.x | bash - \
    && apt-get -y install nodejs \
    && mkdir /.npm \
    && chmod 777 /.npm
# ,     
#   tfjs-node-gpu      GPU :(
ENV TF_FORCE_GPU_ALLOW_GROWTH=true
#  node-
WORKDIR /src
COPY package.json /src/
RUN npm install
#      
COPY app.js /src/
ENTRYPOINT node /src/app.js

Now let's talk about getting results. But I warn you right away: I am not a Node.js expert! This is just the result of my evening experiments, so be lenient to me :-).

The following simple script is busy processing a binary mask image sent to the server using an HTTP POST request. A mask is a two-dimensional array of pixels. Pixels represented by zeros are the background.

Here is the file code app.js:

const tf = require('@tensorflow/tfjs-node-gpu');
const bodyPix = require('@tensorflow-models/body-pix');
const http = require('http');
(async () => {
    const net = await bodyPix.load({
        architecture: 'MobileNetV1',
        outputStride: 16,
        multiplier: 0.75,
        quantBytes: 2,
    });
    const server = http.createServer();
    server.on('request', async (req, res) => {
        var chunks = [];
        req.on('data', (chunk) => {
            chunks.push(chunk);
        });
        req.on('end', async () => {
            const image = tf.node.decodeImage(Buffer.concat(chunks));
            segmentation = await net.segmentPerson(image, {
                flipHorizontal: false,
                internalResolution: 'medium',
                segmentationThreshold: 0.7,
            });
            res.writeHead(200, { 'Content-Type': 'application/octet-stream' });
            res.write(Buffer.from(segmentation.data));
            res.end();
            tf.dispose(image);
        });
    });
    server.listen(9000);
})();

To convert a frame to a mask, we, in a Python script, can use the numpy and requests packages :

def get_mask(frame, bodypix_url='http://localhost:9000'):
    _, data = cv2.imencode(".jpg", frame)
    r = requests.post(
        url=bodypix_url,
        data=data.tobytes(),
        headers={'Content-Type': 'application/octet-stream'})
    #     numpy-
    #     uint8[width * height]   0  1
    mask = np.frombuffer(r.content, dtype=np.uint8)
    mask = mask.reshape((frame.shape[0], frame.shape[1]))
    return mask

The result is approximately the following.


Mask

While I was doing all this, I came across the next tweet.


This is definitely the best background for video calls.

Now that we have a mask to separate the foreground from the background, replacing the background with something else will be very simple.

I took the background image from the tweet branch and cut it so that I get a 16x9 picture.


Background image

After that I did the following:

#    (     16:9)
replacement_bg_raw = cv2.imread("background.jpg")

#    ,       (width & height   )
width, height = 720, 1280
replacement_bg = cv2.resize(replacement_bg_raw, (width, height))

#     ,   
inv_mask = 1-mask
for c in range(frame.shape[2]):
    frame[:,:,c] = frame[:,:,c]*mask + replacement_bg[:,:,c]*inv_mask

That's what I got after that.


The result of replacing the background.

Such a mask is obviously not accurate enough, the reason for this is the performance trade-offs that we made when setting up BodyPix. In general, while everything looks more or less tolerant.

But, when I looked at this background, one idea came to me.

Interesting experiments


Now that we’ve figured out how to mask, we’ll ask how to improve the result.

The first obvious step is to soften the edges of the mask. For example, this can be done like this:

def post_process_mask(mask):
    mask = cv2.dilate(mask, np.ones((10,10), np.uint8) , iterations=1)
    mask = cv2.erode(mask, np.ones((10,10), np.uint8) , iterations=1)
    return mask

This will improve the situation a bit, but there isn’t much progress. And a simple replacement is quite boring. But, since we got to all this ourselves, this means that we can do anything with the picture, and not just remove the background.

Given that we are using a virtual background from Star Wars, I decided to create a hologram effect in order to make the picture more interesting. This, in addition, allows you to smooth out the blur of the mask.

First, update the post-processing code:

def post_process_mask(mask):
    mask = cv2.dilate(mask, np.ones((10,10), np.uint8) , iterations=1)
    mask = cv2.blur(mask.astype(float), (30,30))
    return mask

The edges are now blurry. This is good, but we still need to create a hologram effect.

Hollywood holograms usually have the following properties:

  • A pale color or monochrome picture - as if drawn by a bright laser.
  • An effect reminiscent of scan lines or something like a grid - as if the image is displayed in several rays.
  • “Ghost effect” - as if the projection is performed in layers or as if the correct distance at which it should be displayed were not maintained during the creation of the projection.

All these effects can be implemented step by step.

First, to color the image in a shade of blue, we can use the method applyColorMap:

#     -  
holo = cv2.applyColorMap(frame, cv2.COLORMAP_WINTER)

Next - add a sweep line with an effect reminiscent of leaving in halftone:

#    bandLength    10-30%,
#    bandGap.
bandLength, bandGap = 2, 3
for y in range(holo.shape[0]):
    if y % (bandLength+bandGap) < bandLength:
        holo[y,:,:] = holo[y,:,:] * np.random.uniform(0.1, 0.3)

Next, we implement the “ghost effect” by adding shifted weighted copies of the current effect to the image:

# shift_img : https://stackoverflow.com/a/53140617
def shift_img(img, dx, dy):
    img = np.roll(img, dy, axis=0)
    img = np.roll(img, dx, axis=1)
    if dy>0:
        img[:dy, :] = 0
    elif dy<0:
        img[dy:, :] = 0
    if dx>0:
        img[:, :dx] = 0
    elif dx<0:
        img[:, dx:] = 0
    return img

#    : holo * 0.2 + shifted_holo * 0.8 + 0
holo2 = cv2.addWeighted(holo, 0.2, shift_img(holo1.copy(), 5, 5), 0.8, 0)
holo2 = cv2.addWeighted(holo2, 0.4, shift_img(holo1.copy(), -5, -5), 0.6, 0)

And finally, we want to keep some of the original colors, so we combine the holographic effect with the original frame, doing something like adding a “ghost effect”:

holo_done = cv2.addWeighted(img, 0.5, holo2, 0.6, 0)

Here's what a frame with a hologram effect looks like:


A frame with a hologram effect

This frame itself looks pretty good.

Now let's try to combine it with the background.


Image overlaid on the background.

Done! (I promise - this kind of video will look more interesting).

Video output


And now I must say that we have missed something here. The fact is that we still cannot use all of these for making video calls.

In order to fix this, we will use pyfakewebcam and v4l2loopback to create a dummy webcam.

In addition, we plan to attach this camera to the Docker.

First, create a fakecam/requirements.txtdependency description file :

numpy==1.18.2
opencv-python==4.2.0.32
requests==2.23.0
pyfakewebcam==0.1.0

Now create a file fakecam/Dockerfilefor the application that implements the capabilities of a dummy camera:

FROM python:3-buster
#   pip
RUN pip install --upgrade pip
#   opencv
RUN apt-get update && \
    apt-get install -y \
      `# opencv requirements` \
      libsm6 libxext6 libxrender-dev \
      `# opencv video opening requirements` \
      libv4l-dev
#    requirements.txt
WORKDIR /src
COPY requirements.txt /src/
RUN pip install --no-cache-dir -r /src/requirements.txt
#   
COPY background.jpg /data/
#     (     )
COPY fake.py /src/
ENTRYPOINT python -u fake.py

Now, from the command line, install v4l2loopback:

sudo apt install v4l2loopback-dkms

Set up a dummy camera:

sudo modprobe -r v4l2loopback
sudo modprobe v4l2loopback devices=1 video_nr=20 card_label="v4l2loopback" exclusive_caps=1

To ensure the functionality of some applications (Chrome, Zoom), we need a setting exclusive_caps. The mark card_labelis set only to ensure the convenience of choosing a camera in applications. Indication of the number video_nr=20leads to the creation of the device /dev/video20if the corresponding number is not busy, and it is unlikely to be busy.

Now we’ll make changes to the script to create a dummy camera:

# ,  ,   ,   ,  width  height
fake = pyfakewebcam.FakeWebcam('/dev/video20', width, height)

It should be noted that pyfakewebcam expects images with RGB channels (Red, Green, Blue - red, green, blue), and Open CV works with the order of BGR channels (Blue, Green, Red).

You can fix this before outputting the frame, and then send the frame like this:

frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
fake.schedule_frame(frame)

Here is the full script code fakecam/fake.py:

import os
import cv2
import numpy as np
import requests
import pyfakewebcam

def get_mask(frame, bodypix_url='http://localhost:9000'):
    _, data = cv2.imencode(".jpg", frame)
    r = requests.post(
        url=bodypix_url,
        data=data.tobytes(),
        headers={'Content-Type': 'application/octet-stream'})
    mask = np.frombuffer(r.content, dtype=np.uint8)
    mask = mask.reshape((frame.shape[0], frame.shape[1]))
    return mask

def post_process_mask(mask):
    mask = cv2.dilate(mask, np.ones((10,10), np.uint8) , iterations=1)
    mask = cv2.blur(mask.astype(float), (30,30))
    return mask

def shift_image(img, dx, dy):
    img = np.roll(img, dy, axis=0)
    img = np.roll(img, dx, axis=1)
    if dy>0:
        img[:dy, :] = 0
    elif dy<0:
        img[dy:, :] = 0
    if dx>0:
        img[:, :dx] = 0
    elif dx<0:
        img[:, dx:] = 0
    return img

def hologram_effect(img):
    #    
    holo = cv2.applyColorMap(img, cv2.COLORMAP_WINTER)
    #   
    bandLength, bandGap = 2, 3
    for y in range(holo.shape[0]):
        if y % (bandLength+bandGap) < bandLength:
            holo[y,:,:] = holo[y,:,:] * np.random.uniform(0.1, 0.3)
    #  
    holo_blur = cv2.addWeighted(holo, 0.2, shift_image(holo.copy(), 5, 5), 0.8, 0)
    holo_blur = cv2.addWeighted(holo_blur, 0.4, shift_image(holo.copy(), -5, -5), 0.6, 0)
    #     
    out = cv2.addWeighted(img, 0.5, holo_blur, 0.6, 0)
    return out

def get_frame(cap, background_scaled):
    _, frame = cap.read()
    #      (  ,   )
    #       
    mask = None
    while mask is None:
        try:
            mask = get_mask(frame)
        except requests.RequestException:
            print("mask request failed, retrying")
    # -   
    mask = post_process_mask(mask)
    frame = hologram_effect(frame)
    #     
    inv_mask = 1-mask
    for c in range(frame.shape[2]):
        frame[:,:,c] = frame[:,:,c]*mask + background_scaled[:,:,c]*inv_mask
    return frame

#     
cap = cv2.VideoCapture('/dev/video0')
height, width = 720, 1280
cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
cap.set(cv2.CAP_PROP_FPS, 60)

#   
fake = pyfakewebcam.FakeWebcam('/dev/video20', width, height)

#    
background = cv2.imread("/data/background.jpg")
background_scaled = cv2.resize(background, (width, height))

#    
while True:
    frame = get_frame(cap, background_scaled)
    #    RGB-
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    fake.schedule_frame(frame)

Now collect the images:

docker build -t bodypix ./bodypix
docker build -t fakecam ./fakecam

Run them:

#  
docker network create --driver bridge fakecam
#   bodypix
docker run -d \
  --name=bodypix \
  --network=fakecam \
  --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  bodypix
#  ,  ,      ,  ,
#           
# ,     `sudo groupadd $USER video`
docker run -d \
  --name=fakecam \
  --network=fakecam \
  -p 8080:8080 \
  -u "$$(id -u):$$(getent group video | cut -d: -f3)" \
  $$(find /dev -name 'video*' -printf "--device %p ") \
  fakecam

It remains only to consider that this must be started before the camera is opened when working with any applications. And in Zoom or somewhere else you need to select a camera v4l2loopback/ /dev/video20.

Summary


Here is a clip that demonstrates the results of my work.


Background change result

See! I’m calling from the Millennium Falcon using the open source technology stack for working with the camera!

What I did, I really liked. And I will definitely take advantage of all this at the next video conference.

Dear readers! Are you planning to change what is visible during video calls behind you for something else?


All Articles