🧗🏽 👨🏾‍🤝‍👨🏻 👼 Implementing WebRTC in a media server

1. Streaming to browsers in real time - no solution. Or is there?

For about 20 years now, the network bandwidth and computing capabilities of computers have enabled the compression and broadcasting of sound and video over the IP protocol in near real-time mode. During this time, hundreds of standards and protocols have been developed by central standardization organizations, such as the W3C and IETF, as well as many large and small companies, for efficiently compressing, packing, forwarding, synchronizing and playing audio-video content on computers and mobile devices. Real-time video capture, compression and broadcasting over IP was given special attention, since, firstly, IP is the cheapest and most accessible at all levels, and secondly, video conferencing and video surveillance technologies are vital and are in great demand.

It would seem that so many years have passed and so much work has been done. What wonderful achievements in this area can we observe after 20 years? Let's remove the box lid (of course, this is not Pandora’s box and not “can of worms”) and see what wonderful technologies and capabilities have become available after many years of work by tens of thousands of talented software engineers. A programmer from 1998, who first sent the sound over the network, a doctor who wants a simple, cheap and reliable telemedicine solution, a teacher who needs to conduct a remote lesson - now they open this cover, full of bright hopes, and what do they see? In an offensive boiling pan full of mind-blowing marketing, cynical capitalism and desperate attempts by enthusiasts to improve things, all sorts of codecs, protocols, formats and applications are floating.This is what the “community” IT soup offers the consumer in real time. Catch yourself that smells less, try, test, buy. There is no simple and effective solution. Unlike streaming, which does not require real time: after all there is already about 5 years there is a HLS standard that works on any browsers and devices where the solution provider can simply install the HLS segmenter on your server and sleep peacefully.

Here is RTSP - a bunch of consoles and professional equipment play it, but browsers don't. Here is RTMP - Safari does not want to play it on iOS and not all Androids play it. Chrome prohibits it as untrustworthy. Here is MPEG2-TS - browsers do not play it either. HTML5 Media Source Extensions (MSE) - good for video segments 5-10 seconds long (i.e. for HLS / Dash), but for short segments less than one second - not always stable, works differently in different browsers and again not supported on iOS.

How, one wonders, does the kindergarten send video from cameras installed in groups to parents who want to open the browser at any time on any device, and without installing any plug-ins to watch their children in real time? Why all kindergartens do not offer such services? Yes, because providing such a service is very expensive. We need to develop Apps for mobile devices, where the video will play - because browsers do not play. Need a lot more.

Let's define the concept of "close to real time." This is less than 5 seconds delay for video surveillance and less than 1 second for video conferencing. The average delay of the HLS protocol is 20-30 seconds. Maybe it’s somehow suitable for kindergartens, but for security video surveillance, video conferencing and webinars, another technology is needed.

So, until now, more precisely until the summer of 2017, there was no single standard or protocol for broadcasting audio-video to any browser on any device in real time. The reasons for this situation, we will consider in this article later. They are not of a technical nature, these reasons. In the meantime, let's see what happened in the summer of 2017, which at the very least, but still provided a technology that allows us to solve the above problems. This technology is WebRTC, much has been written about it both on this resource and on the network in general. It can no longer be called completely new, and at the time of this writing, W3C considers WebRTC 1.0 a completed project. We will not talk here about what WebRTC is; if the reader is not familiar with this technology, then we suggest doing a search on the hub or in google and get acquainted,what it is used for and how it works in general terms. Here we just say that this technology was developed for peer-to-peer communication in browsers, with it you can implement video chat and voice applications without any server - the browser communicates directly with the browser. WebRTC is supported by all browsers on all devices, and in the summer of 2017, Apple finally came down to us and added it to its Safari on iOS. It was this event that made WebRTC the most universal and generally accepted technology for real-time streaming to browsers, since the sunset of RTMP, which began in 2015.WebRTC is supported by all browsers on all devices, and in the summer of 2017, Apple finally came down to us and added it to its Safari on iOS. It was this event that made WebRTC the most versatile and generally accepted technology for real-time streaming to browsers since the sunset of RTMP, which began in 2015.WebRTC is supported by all browsers on all devices, and in the summer of 2017, Apple finally came down to us and added it to its Safari on iOS. It was this event that made WebRTC the most universal and generally accepted technology for real-time streaming to browsers, since the sunset of RTMP, which began in 2015.

However, what does streaming to browsers from cameras have to do with it? But the fact is that WebRTC is very flexible in its functionality, and allows you to send audio-video to only one of the two participants (peers), and only accept the other. Therefore, the idea was born to adapt WebRTC in media servers. The media server can receive video from the camera, establish communication with the browser, and agree that only it will send and the browser will receive. Thus, the Media Server can simultaneously send video from the camera to many browsers / viewers. Conversely, a media server can receive a stream from a browser, and forward it to, say, many other browsers, implementing the much-desired “one-to-many” function.

So, finally everything was formed? Akuna Matata and kindergarten will be able to install such a media server somewhere on the hosting or on AWS, send one stream from each camera there, and from there it will already be distributed to the browsers of the parents, all with a delay of no more than one second. In general - yes, life is getting better. But there are problems. And these problems are related to the fact that WebRTC is, as it were, far-fetched for such tasks; it was not designed for them and not quite suitable for them. Problems, in addition to codec compatibility, exist primarily with the scalability of such a media server. That is, at the same time 100 parents can be served from one server computer, and 500 is already difficult. Although the network allows. And look at the processor load on the server with 100 connections - it is already close to 90%. How so? After all, just send a sound video.

With the same stream, if sent via the RTMP protocol to the Flash player, then you can easily support 2000 simultaneous connections from one server. Is WebRTC only 100?
Why? There are two reasons: firstly, the WebRTC protocol is much more computationally expensive - there, for example, all data is encrypted, and it takes up a lot of processor time. And the second reason, which we will discuss in more detail, is the extremely inefficient implementation of the protocol by its creator - Google, which provides the source c ++ code for this implementation for adaptation in servers, gateways and other applications that want to support WebRTC: webrtc.org/native-code

2 Google's Native WebRTC API and Media Server Compatibility

Recall that WebRTC was created to transfer audio-video from the browser to the browser and there were no tasks to support many simultaneous connections. Therefore, and not only therefore, the implementation of WebRTC in the browser completely did not give a damn about the basic principle of design and architecture of technical systems - elegance (nothing more), efficiency, high performance. The emphasis was placed on reliability and manageability with errors and extreme situations in the network - loss of packets, connections, etc. Which, of course, is good. However, upon closer examination, it turns out that this is the only thing that is good in the Google implementation of WebRTC.

Let's look at the main points because of which the use of the Google implementation of WebRTC for media servers is extremely problematic.

2.a The code is 10 times more than it should be and it is extremely inefficient.

This is a proven number. To get started, you download about 5 gigabytes of code, of which only 500 megabytes are relevant to WebRTC. Then you try to get rid of the unnecessary code. After all, for the needs of a media server you do not need encoding / decoding; the server should only receive content and forward it to everyone. When you removed all the unnecessary that you could (and you could remove much less than you would like), you still have 100 megabytes of code. This is a monstrous figure. It is she who is 10 times bigger than it should be.

By the way, at this point, many will say - how is encoding / decoding not needed? What about transcoding from AAC to Opus and vice versa? What about transcoding VP9-> H264? If you are going to do such transcoding on the server, then you can’t pull 5 simultaneous connections either. If it is really necessary, transcoding should be done not by a media server, but by another program.

But let's go back to the problem of bloated code and illustrate it. What do you think is the depth of the function call stack when sending an already compressed video frame? One call to winsock (on Windows) of the send or sendto function (WSASend / WSASendTo)? No, of course, some more work needs to be done. In the case of WebRTC, you need to pack the frame over the RTP protocol and encrypt it, which in total gives us the SRTP protocol. You need to save the frame in case of packet loss in order to send it again later. How many c ++ objects and threads should be involved in this?

Here's how WebRTC 61

does it : As you can see from this screenshot, from the moment we feed the compressed frame to WebRTC until the queue of the Paced_Sender object, the call stack depth is 8 (!) And 7 objects are involved!

Then a separate thread (thread) PacedSender pulls our frame from the queue and sends it further for processing:

And finally, we came to step 4, where the already RTP-packed and encrypted frame relies on the queue to be sent to the network, which is engaged in another thread. At this point, the depth of the call stack (on the PacedSender thread) is 7, and 3 more new objects are involved. The thread busy sending will call the final WSASend / WSASendTo also after 3-4 nested function calls and will involve 3-4 more new objects.

So, we saw 3 threads, each of which does a great job. Everyone who programmed such systems has an idea of how such things are done, and what really needs to be done. According to our estimates, at least 90% of the objects and code here are superfluous and violate the principles of object-oriented programming.

2.b 4-5 threads are allocated per connection

No doubt, with the number of threads in this example, everything is in order. It is necessary to provide asynchronous processing, not to block anyone, and all 3 threads are needed. In general, one PeerConnection WebRTC allocates 4-5 threads. Well, it would be possible to keep within 3. But no less. The problem is that this is for every connection! In the server, for example, you can save 3 threads, but they will serve all the connections together, and not allocate 3 threads to each connection. The thread pool is an undoubted server solution for such tasks.

2.c Asynchronous sockets working through windows messages

Google WebRTC code on Windows uses asynchronous sockets through WSAAsyncSelect. Server programmers know that using the select function on the server is suicide, and WSAAsyncSelect, although it improves the situation, but not by an order of magnitude. If you want to support hundreds and thousands of connections, there is a better solution on Windows than asynchronous sockets. Overlapped sockets and IO Completion Ports must be enabled, sending notifications to the thread pool that is doing the work.

2.d Conclusion

So, we can conclude: applying Google's WebRTC code, without major changes, to a media server is possible, but the server will not be able to pull hundreds of simultaneous connections. There may be two solutions:

To make serious changes to the Google code is, without exaggeration, close to impossible - after all, all these objects are very tightly matched to each other, do not encapsulate functionality, are not independent blocks that perform certain work, as it should be. Involving them unchanged in other scenarios is impossible.

Do not use Google code at all, but implement everything yourself using open libraries such as libsrtp and the like. Perhaps this is the right way, but besides the fact that this is also a huge job, you may encounter the fact that your implementation will not be fully compatible with Google, and, accordingly, will not work, or will not work in all cases, to for example, with chrome, which cannot be tolerated. You can then argue with the guys from Google for a long time, prove that you have followed the standard, but they haven’t, and you will be right a thousand times. But they, at best, will say - “we will fix it, maybe somehow later.” You need to adjust to chrome right now. And the point.

3. Why is everything so sad

This situation with streaming to browsers in real time is a very characteristic illustration of what “business driven technology” sometimes leads to. Technology motivated by business is developing in the direction in which it is necessary for business and insofar as it is pleasing to this business. It is thanks to the business approach that we now have personal computers and mobile phones - no government or central planning ministry could ever be so interested in developing and introducing all these consumer technologies to the masses. Private business, motivated by the personal gain of its owners, did this as soon as a technical opportunity arose.

It has long been known, understood and accepted that non-essential consumer goods and services, those without which you can live in peace, are better developed by private business, then the things that are vitally necessary for a person - energy, roads, police and school education - should be developed centrally. state-controlled institutions.

We, the children of the Soviet Union and the mentality “let's make technically correct and strong technology so that people can use it and everything is fine,” could, of course, say that in a planned Soviet system (if the government suddenly decided), the streaming technology Real-time IP could be developed and implemented in a year and would be an order of magnitude better than what the business has now gained in 20 years. But we also understand that then it would not develop, become obsolete and, in the end, in the long run, still lose some commercial Western technology.

Therefore, since it is quite possible to get along without streaming-shimming, it is rightly left at the mercy of private business. Which develops it in their own interests, and not in the interests of the consumer. How is it not in the interests of the consumer? But what about supply and demand? What does the consumer need, then the business will offer? But it does not offer. All consumers are shouting - Google, support AAC audio in WebRTC, but Google will never do it, although it just spits to do it. Apple absolutely does not give a damn and doesn’t implement anything from the much-needed streaming technologies in its gadgets. Why? Yes, because not always the business does what the consumer needs. He does not do this when he is a monopolist and is not afraid of the consumer losing. Then the business is busy strengthening its position. So Google bought in recent years a bunch of manufacturers of sound codecs.And now it pushes Opus audio, and forces the whole world to transcode AAC-> Opus to match WebRTC, since all the technology has long switched to AAC audio. Google justifies this allegedly with the fact that AAC is a paid technology, and Opus is free. But in fact, this is done in order to establish its technology as a standard. As Apple once did with its wretched HLS, which we were made to love, or as Adobe did with its irresponsible RTMP protocol even earlier. Gadgets and browsers are still quite technically difficult things to develop, from here monopolists arise, from here, as they say, things are there. And the W3C and IETF are sponsored by the same monopolists, so the mentality of “let's make technically correct and strong technology so that people can use it and everything is fine” is not there and never will be.But it should have been.

What is the way out of this situation? Apparently, just waiting for the “right” business-driven technology, the result of competition and all sorts of other wonderful things, finally, will come up with something democratic, suitable for a simple rural doctor so that he can provide telemedicine services with his normal Internet. Indeed, it is necessary to make an amendment, not for a simple rural doctor, but for those who can pay big bucks, the business has long been offering real-time streaming solutions. Good, reliable, requiring dedicated networks and special equipment. In many cases, and not working on IP protocol. Which - and this is another reason for such a sad situation - was not created for real time, and does not always guarantee it. Not always, but not in vital situations, it is quite suitable at the moment.So let's try WebRTC. So far, of all evils, he is the smallest and most democratic. For which, after all, you need to say thanks to Google.

4. A bit about media servers that implement WebRTC

Wowza, Flashphoner, Kurento, Flussonic, Red5 Pro, Unreal Media Server - these are some of the media servers that support WebRTC. They provide the publication of video from browsers to the server and broadcast video to browsers via WebRTC from the server.

The problems described in this article, in different ways and with varying degrees of success, are resolved in these software products. Some of them, for example, Kurento and Wowza, do audio-video transcoding directly in the server, others, for example, Unreal Media Server, do not transcode themselves, but provide other programs for this. Some servers, such as Wowza and Unreal Media Server, support streaming on all connections through one central TCP and UDP port, because WebRTC itself allocates a separate port for each connection, so the provider has to open many ports in the firewall, which creates security problems.

There are many points and subtleties implemented in all of these servers in different ways. How much this all suits the consumer, judge you, dear users.

Implementing WebRTC in a media server - practice and policy

More articles: