🔭 👲🏿 ✍🏼 The power of PWA: A video surveillance system with a 300-line neural network JS code 👂🏿 ♐️ 🥪

Hello, Habr!

Web browsers slowly but surely implement most of the features of the operating system, and there is less and less reason to develop a native application if you can write a web version (PWA). Cross-platform, rich API, high development speed on TS / JS, and even the performance of the V8 engine - all this is a plus. Browsers have long been able to work with a video stream and run neural networks, that is, we have all the components for creating a video surveillance system with object recognition. Inspired by this article , I decided to bring the demo to the level of practical application, which I want to share.

The application records video from the camera, periodically sending frames for recognition in the COCO-SSD, and if a person is detected, video fragments in portions of 7 seconds begin to be sent to the specified email via the Gmail-API. As in adult systems, prerecording is performed, that is, we save one fragment until the moment of detection, all fragments with detection, and one after. If the Internet is unavailable, or an error occurs while sending, the videos are saved in the local Downloads folder. Using the email allows you to do without the server side, instantly notify the owner, and if an attacker took possession of the device and cracked all the passwords, it will not be able to delete mail from the recipient. Of the minuses - traffic overrun due to Base64 (although it’s enough for one camera), and the need to collect the final video file from many emails.

The working demo is here .

The problems encountered are as follows:

1) The neural network heavily loads the processor, and if you run it in the main thread, lags appear on the videos. Therefore, recognition is placed in a separate thread (worker), although not everything is smooth here. On dual-core prehistoric Linux, everything is perfectly parallel, but on some fairly new 4-core mobile phones - at the moment of recognition (in the worker), the main thread also starts to lag, which is noticeable in the user interface. Fortunately, this does not affect the quality of the video, although it reduces the recognition frequency (it automatically adjusts to the load). This problem is probably related to how different versions of Android distribute threads by core, the presence of SIMD, available video card functions, etc. I can’t figure it out on my own, I don’t know the insides of TensorFlow, and I will be grateful for the information.

2) FireFox. The application works fine under Chrome / Chromium / Edge, however, recognition in FireFox is noticeably slower, in addition, ImageCapture is still not implemented (of course, this can be bypassed by capturing a frame from <video>, but it’s a shame for the fox, because it’s standard API). In general, there was no complete cross-browser accessibility either.

So, everything in order.

Getting a camera and microphone

this.video = this.querySelector('video')
this.canvas = this.querySelectorAll('canvas')[0]

this.stream = await navigator.mediaDevices.getUserMedia(
   {video: {facingMode: {ideal: "environment"}}, audio: true}
)
this.video.srcObject = this.stream
await new Promise((resolve, reject) => {
   this.video.onloadedmetadata = (_) => resolve()
})
this.W = this.bbox.width = this.canvas.width = this.video.videoWidth
this.H = this.bbox.height = this.canvas.height = this.video.videoHeight

Here we select the main camera of the mobile phone / tablet (or the first one on the computer / laptop), display the stream in a standard video player, after which we wait for the metadata to load and set the dimensions of the service canvas. Since the entire application is written in the style of async / await, you have to convert callback-APIs (and there are quite a lot of them) to Promise for uniformity.

Video capture

There are two ways to capture video. The first is to directly read the frames from the incoming stream, display them on the canvas, modify them (for example, add geo and timestamps), and then take the data from the canvas - for the recorder as an outgoing stream, and for a neural network as separate images. In this case, you can do without the <video> element.

this.capture = new ImageCapture(this.stream.getVideoTracks()[0])
this.recorder = new MediaRecorder(this.canvas.captureStream(), {mimeType : "video/webm"})

grab_video()

async function grab_video() {
	this.canvas.drawImage(await this.capture.grabFrame(), 0, 0)
	const img = this.canvas.getImageData(0, 0, this.W, this.H)
	... //    -   img
	... //   -    
        window.requestAnimationFrame(this.grab_video.bind(this))
}

The second way (working in FF) is to use a standard video player to capture. By the way, it consumes less processor time, unlike frame-by-frame display on canvas, but we cannot add an inscription.

...
async function grab_video() {
	this.canvas.drawImage(this.video, 0, 0)
	...
}

The application uses the first option, as a result of which the video player can be turned off during the recognition process. In order to save processor, recording is carried out from the incoming stream, and drawing frames on canvas is used only to obtain an array of pixels for a neural network, with a frequency depending on the recognition speed. We draw the frame around the person on a separate canvas placed on the player.

Neural network loading and human detection

It's all indecently simple. We start the worker , after loading the model (for a rather long time) we send an empty message to the main thread, where in the onmessage event we show the start button, after which the worker is ready to receive images. Full worker code:

(async () => {
  self.importScripts('https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js')
  self.importScripts('https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd')

  let model = await cocoSsd.load()
  self.postMessage({})

  self.onmessage = async (ev) => {
    const result = await model.detect(ev.data)
    const person = result.find(v => v.class === 'person')
    if (person) 
      self.postMessage({ok: true, bbox: person.bbox})
    else
      self.postMessage({ok: false, bbox: null})
  }
})()

In the main thread, we start the grab_video () function only after receiving the previous result from the worker, that is, the detection frequency will depend on the system load.

Video recording

this.recorder.rec = new MediaRecorder(this.stream, {mimeType : "video/webm"})
this.recorder.rec.ondataavailable = (ev) => {
   this.chunk = ev.data
   if (this.detected) {
      this.send_chunk()
   } else if (this.recorder.num > 0) {
      this.send_chunk()
      this.recorder.num--
   }
}
...
this.recorder.rec.start()
this.recorder.num = 0
this.recorder.interval = setInterval(() => {
   this.recorder.rec.stop()
   this.recorder.rec.start()
}, CHUNK_DURATION)

Each time the recorder is stopped (we use a fixed interval), the ondataavailable event is raised, where the recorded fragment in the Blob format is transferred, saved in this.chunk and sent asynchronously. Yes, this.send_chunk () returns a promise, but the function takes a long time (encoding in Base64, sending an email or saving the file locally), and we do not wait for it to be executed and do not process the result - therefore there is no await. Even if it turns out that new video clips appear more often than they can be sent, the JS engine arranges the line of promises transparently for the developer, and all the data will be sent / written sooner or later. The only thing worth paying attention to is inside the send_chunk () function before the first await, you need to clone the Blob with the slice () method, since the this.chunk link is rubbed every CHUNK_DURATION seconds.

Gmail API

Used to send letters. The API is quite old, partly on promises, partly on callbacks, documentation and examples are not plentiful, so I will give the full code.

Authorization . we get the application and client keys in the Google developer console. In a pop-up authorization window, Google reports that the application has not been verified, and you will have to click "advanced settings" to enter. Checking the application in Google turned out to be a non-trivial task, you need to confirm the ownership of the domain (which I do not have), correctly arrange the main page, so I decided not to bother.

await import('https://apis.google.com/js/api.js')
gapi.load('client:auth2', async () => {
   try {
      await gapi.client.init({
         apiKey: API_KEY,
         clientId: CLIENT_ID,
         discoveryDocs: ['https://www.googleapis.com/discovery/v1/apis/gmail/v1/rest'],
         scope: 'https://www.googleapis.com/auth/gmail.send'
      }) 
      if (!gapi.auth2.getAuthInstance().isSignedIn.je) {
         await gapi.auth2.getAuthInstance().signIn()
      }
      this.msg.innerHTML = ''
      this.querySelector('nav').style.display = ''
   } catch(e) {
      this.msg.innerHTML = 'Gmail authorization error: ' + JSON.stringify(e, null, 2)
   }
})

Email sending . Base64 encoded strings cannot be concatenated, and this is inconvenient. How to send video in binary format, I still did not understand. In the last lines, we convert the callback to a promise. Unfortunately this has to be done quite often.

async send_mail(subject, mime_type, body) {
   const headers = {
      'From': '',
      'To': this.email,
      'Subject': 'Balajahe CCTV: ' + subject,
      'Content-Type': mime_type,
      'Content-transfer-encoding': 'base64'
   }
   let head = ''
   for (const [k, v] of Object.entries(headers)) head += k + ': ' + v + '\r\n'
   const request = gapi.client.gmail.users.messages.send({
      'userId': 'me',
      'resource': { 'raw': btoa(head + '\r\n' + body) }
   })
   return new Promise((resolve, reject) => {
      request.execute((res) => {
         if (!res.code) 
            resolve() 
         else 
            reject(res)
      })
   })
}

Saving a video clip to disk. We use a hidden hyperlink.

const a = this.querySelector('a')
URL.revokeObjectURL(a.href)
a.href = URL.createObjectURL(chunk)
a.download = name
a.click()

State management in the world of web components

Continuing the idea presented in this article , I brought it to the ~~absurdity of the~~ logical end (for the lulz only) and turned the control of the state upside down. If usually JS variables are considered as a state, and the DOM is only the current display, then in my case the data source is the DOM itself (since web components are the long-lived DOM nodes), and for using data on the JS side, the web components provide getters / setters for each form field. So, for example, instead of uncomfortable checkboxes in styling, simple <button> are used, and the button “value” (true is pressed, false is pressed) is the value of the class attribute, which allows you to style it like this:

button.true {background-color: red}

and get the value like this:

get detecting() { return this.querySelector('#detecting').className === 'true' }

I can not advise using this in production, because this is a good way to ditch productivity. Although ... the virtual DOM is also not free, and I did not do benchmarks.

Offline mode

Finally, add a little PWA, namely, install a service worker who will cache all network requests and allow the application to work without access to the Internet. A small nuance - in articles about service workers, they usually give the following algorithm:

In the install event - create a new version of the cache and add all the necessary resources to the cache.
In the activate event - delete all versions of the cache except the current one.
In the fetch event - first we try to take the resource from the cache, and if we did not find it, we send a network request, the result of which is added to the cache.

In practice, such a scheme is inconvenient for two reasons. Firstly, in the worker's code you need to have an up-to-date list of all the necessary resources, and in large projects using third-party libraries, try to keep track of all the attached imports (including dynamic ones). The second problem - when changing any file, you need to increase the version of the service worker, which will lead to the installation of a new worker and invalidation of the previous one, and this will happen ONLY when the browser is closed / opened. A simple page refresh will not help - the old worker with the old cache will work. And where is the guarantee that my clients will not keep the browser tab forever? Therefore, first we make a network request, we add the result to the cache asynchronously (without waiting for the permission resolution cache.put (ev.request, resp.clone ())), and if the network is unavailable, then we get it from the cache. Better to lose a daythen fly in 5 minutes ©.

Unresolved issues

On some mobile phones, the neural network slows down, maybe in my case, COCO-SSD is not the best choice, but I'm not an ML expert, and I took the first one that was heard.
I did not find an example of how to send video via GAPI not in Base64 format, but in the original binary. This would save both processor time and network traffic.
I did not understand security. For local debugging purposes, I added the localhost domain to the Google application, but if someone starts using the application keys to send spam - will Google block the keys themselves or the sender's account?

I would be grateful for the feedback.

Sources on github.

Thank you for the attention.

The power of PWA: A video surveillance system with a 300-line neural network JS code