FFmpeg libav manual


Long I searched for a book that would be chewed to use FFmpeg -like library known as libav (the name stands for lib rary a udio v ideo ). Found a textbook " How to write a video player and fit in less than a thousand lines ." Unfortunately, the information there is outdated, so I had to create a manual on my own.

Most of the code will be in C, but do not worry: you will easily understand everything and can apply it in your favorite language. FFmpeg libav has a lot of bindings to many languages ​​(including Python and Go). But even if your language does not have direct compatibility, you can still get attached via ffi (here is an example withLua ).

Let's start with a short digression on what video, audio, codecs and containers are. Then we move on to the crash course on using the FFmpeg command line, and finally write the code. Feel free to go straight to the β€œThe Thorny Path to Learning FFmpeg libav” section.

There is an opinion (and not only mine) that streaming Internet video has already taken the baton from traditional television. Be that as it may, FFmpeg libav is definitely worth exploring.

Table of contents


EDISON Software - web-development
EDISON.

, , .

! ;-)

↑


β€” , ! ↑


If the sequence of images is changed at a given frequency (say, 24 images per second), an illusion of movement is created. This is the main idea of ​​the video: a series of images (frames) moving at a given speed.

1886 illustration.

Audio is what you hear! ↑


Although silent video can cause a wide variety of feelings, adding sound dramatically increases the degree of pleasure.

Sound is vibrational waves propagating in air or in any other transmission medium (such as gas, liquid or solid).

In a digital audio system, a microphone converts sound into an analog electrical signal. Then, an analog-to-digital converter ( ADC ) - usually using pulse-code modulation ( PCM ) - converts an analog signal to a digital one.


Codec - data compression ↑


A codec is an electronic circuit or software that compresses or decompresses digital audio / video. It converts raw (uncompressed) digital audio / video into a compressed format (or vice versa).

But if we decide to pack millions of images into one file and call it a movie, we can get a huge file. Let's calculate:

Let's say we create a video with a resolution of 1080 Γ— 1920 (height Γ— width). We spend 3 bytes per pixel (the minimum point on the screen) for color coding (24-bit color, which gives us 16,777,216 different colors). This video works at a speed of 24 frames per second, the total duration of 30 minutes.

toppf = 1080 * 1920 //    
cpp = 3 //  
tis = 30 * 60 //   
fps = 24 //   

required_storage = tis * fps * toppf * cpp

This video will require approximately 250.28 GB of memory, or 1.11 Gb / s! That's why you have to use a codec.

A container is a convenient way to store audio / video ↑


The container (wrapper) format is a metafile format whose specification describes how various data and metadata elements coexist in a computer file.

This is a single file containing all streams (mainly audio and video), providing synchronization, containing common metadata (such as title, resolution), etc.

Typically, the file format is determined by its extension: for example, video.webm is most likely video using the webm container.


Command line FFmpeg ↑


Self-contained cross-platform solution for recording, converting and streaming audio / video.

For working with multimedia, we have an amazing tool - a library called FFmpeg . Even if you don’t use it in your program code, you still use it (are you using Chrome?).

The library has a console program for entering a command line called ffmpeg (in small letters, in contrast to the name of the library itself). This is a simple and powerful binary. For example, you can convert from mp4 to avi by simply typing this command:

$ ffmpeg -i input.mp4 output.avi

We just remixed - converted from one container to another. Technically, FFmpeg can also transcode, but more on that later.

Command line tool FFmpeg 101 ↑


FFmpeg has documentation where everything is perfectly explained how what works.

Schematically, the FFmpeg command-line program expects the following argument format to do its job - ffmpeg {1} {2} -i {3} {4} {5}where:

{1} - global parameters
{2} - parameters of the input file
{3} - incoming URL
{4} - parameters of the output file
{5} - outgoing URL

Parts {2}, {3}, {4}, {5} specify as many arguments as needed. It’s easier to understand the format of passing arguments using an example:

WARNING: a file by reference weighs 300 MB

$ wget -O bunny_1080p_60fps.mp4 http://distribution.bbb3d.renderfarming.net/video/mp4/bbb_sunflower_1080p_60fps_normal.mp4

$ ffmpeg \
-y \ #  
-c: libfdk_aac -c: v libx264 \ #  
-i bunny_1080p_60fps.mp4 \ #  URL
-c: v libvpx-vp9 -c: libvorbis \ #  
bunny_1080p_60fps_vp9.webm #  URL

This command takes an incoming mp4 file containing two streams (audio encoded using the aac codec, and video encoded using the h264 codec), and converts it to webm, changing also the audio and video codecs.

If you simplify the above command, you should consider that FFmpeg will accept the default values ​​instead of you. For example, if you simply type

ffmpeg -i input.avi output.mp4

which audio / video codec does it use to create output.mp4?

Werner Robitz wrote a coding and editing guide for reading / executing with FFmpeg.

Basic video operations ↑


When working with audio / video, we usually perform a number of tasks related to multimedia.

Transcoding (transcoding) ↑




What is it? The process of converting streaming or audio or video (or both at the same time) from one codec to another. The file format (container) does not change.

For what? It happens that some devices (TVs, smartphones, consoles, etc.) do not support the audio / video format X, but support the audio / video format Y. Or, newer codecs are preferable because they provide a better compression ratio.

How? Convert, for example, video H264 (AVC) to H265 (HEVC):

$ ffmpeg \
-i bunny_1080p_60fps.mp4 \
-c:v libx265 \
bunny_1080p_60fps_h265.mp4

Transmultiplexing ↑



What is it? Convert from one format (container) to another.

For what? It happens that some devices (TVs, smartphones, consoles, etc.) do not support the X file format, but support the Y file format. Or, newer containers, unlike older ones, provide the modern required functions.

How? Convert a mp4 to webm:

$ ffmpeg \
-i bunny_1080p_60fps.mp4 \
-c copy \ # just saying to ffmpeg to skip encoding
bunny_1080p_60fps.webm

Transrating ↑



What is it? Change the data rate or create another view.

For what? The user can watch your video both on a 2G network on a low-power smartphone, and via fiber-optic Internet connection on a 4K TV. Therefore, you should offer more than one option for playing the same video with different data rates.

How? produces playback at a bit rate between 3856K and 2000K.

$ ffmpeg \
-i bunny_1080p_60fps.mp4 \
-minrate 964K -maxrate 3856K -bufsize 2000K \
bunny_1080p_60fps_transrating_964_3856.mp4

Typically, transrating is done in conjunction with recalibration. Werner Robitz wrote another obligatory article on speed control FFmpeg.

Transizing (recalibration) ↑



What is it? Resolution change. As stated above, transsizing is often carried out simultaneously with transrating.

For what? For the same reasons as with transrating.

How? Reduce the resolution of 1080 to 480:

$ ffmpeg \
-i bunny_1080p_60fps.mp4 \
-vf scale=480:-1 \
bunny_1080p_60fps_transsizing_480.mp4

Bonus: adaptive streaming ↑



What is it? Creation of many permissions (bitrates) and splitting the media into parts and their transmission via the http protocol.

For what? For the sake of providing flexible multimedia, which can be viewed even on a budget smartphone, even on a 4K plasma, so that it can be easily scaled and deployed (but this can add a delay).

How? Create responsive WebM using DASH:

# video streams
$ ffmpeg -i bunny_1080p_60fps.mp4 -c:v libvpx-vp9 -s 160x90 -b:v 250k -keyint_min 150 -g 150 -an -f webm -dash 1 video_160x90_250k.webm

$ ffmpeg -i bunny_1080p_60fps.mp4 -c:v libvpx-vp9 -s 320x180 -b:v 500k -keyint_min 150 -g 150 -an -f webm -dash 1 video_320x180_500k.webm

$ ffmpeg -i bunny_1080p_60fps.mp4 -c:v libvpx-vp9 -s 640x360 -b:v 750k -keyint_min 150 -g 150 -an -f webm -dash 1 video_640x360_750k.webm

$ ffmpeg -i bunny_1080p_60fps.mp4 -c:v libvpx-vp9 -s 640x360 -b:v 1000k -keyint_min 150 -g 150 -an -f webm -dash 1 video_640x360_1000k.webm

$ ffmpeg -i bunny_1080p_60fps.mp4 -c:v libvpx-vp9 -s 1280x720 -b:v 1500k -keyint_min 150 -g 150 -an -f webm -dash 1 video_1280x720_1500k.webm

# audio streams
$ ffmpeg -i bunny_1080p_60fps.mp4 -c:a libvorbis -b:a 128k -vn -f webm -dash 1 audio_128k.webm

# the DASH manifest
$ ffmpeg \
 -f webm_dash_manifest -i video_160x90_250k.webm \
 -f webm_dash_manifest -i video_320x180_500k.webm \
 -f webm_dash_manifest -i video_640x360_750k.webm \
 -f webm_dash_manifest -i video_640x360_1000k.webm \
 -f webm_dash_manifest -i video_1280x720_500k.webm \
 -f webm_dash_manifest -i audio_128k.webm \
 -c copy -map 0 -map 1 -map 2 -map 3 -map 4 -map 5 \
 -f webm_dash_manifest \
 -adaptation_sets "id=0,streams=0,1,2,3,4 id=1,streams=5" \
 manifest.mpd

PS: I pulled this example from the instructions for playing Adaptive WebM using DASH .

Going beyond ↑


There are no other uses for FFmpeg. I use it with iMovie to create / edit some YouTube videos. And, of course, nothing prevents you from using it professionally.

The thorny path of learning FFmpeg libav ↑

Is it not amazing from time to time that is perceived through hearing and sight?

Biologist David Robert Jones
FFmpeg is extremely useful as a command-line tool for performing important operations with multimedia files. Maybe it can be used in programs too?

FFmpeg consists of several libraries that can be integrated into our own programs. Usually, when you install FFmpeg, all of these libraries are automatically installed. I will refer to a set of these libraries as FFmpeg libav .

The title of the section is a tribute to Zed Shaw's series The Thorny Path of Learning [...] , in particular his book The Thorny Path of Learning C.

Chapter 0 - The Simple Hello World ↑


In our Hello World , you really won’t welcome the world in console language. Instead, print the following information about the video: format (container), duration, resolution, audio channels, and finally, decrypt some frames and save them as image files.

FFmpeg libav architecture ↑


But before we start writing the code, let's see how the FFmpeg libav architecture works in general and how its components interact with others.

Here is a diagram of the video decoding process:

First, the media file is loaded into a component called AVFormatContext (the video container is also a format). In fact, it does not fully download the entire file: often only the header is read.

Once you have downloaded the minimum header of our container , you can access its streams (they can be represented as elementary audio and video data). Each stream will be available in the AVStream component .

Suppose our video has two streams: audio encoded using the AAC codec , and video encoded using the H264 codec ( AVC ). From each stream we can extract pieces of data called packetsthat are loaded into components called AVPacket .

The data inside the packets is still encoded (compressed), and to decode the packets we need to pass them to a specific AVCodec .

AVCodec decodes them into an AVFrame , as a result of which this component gives us an uncompressed frame. Note that the terminology and process are the same for both audio and video streams.

Requirements ↑


Since sometimes there are problems when compiling or running examples, we will use Docker as a development / runtime environment. We will also use a video with a big rabbit , so if you do not have it on your local computer, just run the command make fetch_small_bunny_video in the console .

Actually, the code ↑


TLDR show me an example of executable code, bro:

$ make run_hello

We will omit some details, but don’t worry: the source code is available on github.

We are going to allocate memory for the AVFormatContext component , which will contain information about the format (container).

AVFormatContext *pFormatContext = avformat_alloc_context();

Now we are going to open the file, read its header and fill AVFormatContext with minimal format information (note that codecs usually do not open). To do this, use the avformat_open_input function . It expects AVFormatContext , a file name, and two optional arguments: AVInputFormat (if you pass NULL, FFmpeg will determine the format) and AVDictionary (which are demultiplexer options).

avformat_open_input(&pFormatContext, filename, NULL, NULL);

You can also print the name of the format and the duration of the media:

printf("Format %s, duration %lld us", pFormatContext->iformat->long_name, pFormatContext->duration);

To access the streams, we need to read the data from the media. This is done by the avformat_find_stream_info function . Now pFormatContext-> nb_streams will contain the number of threads, and pFormatContext-> streams [i] will give us the i th flow in a row ( AVStream ).

avformat_find_stream_info(pFormatContext,  NULL);

Let's go through the loop in all threads:

for(int i = 0; i < pFormatContext->nb_streams; i++) {
  //
}

For each stream, we are going to save AVCodecParameters , which describes the properties of the codec used by the i- th stream:

AVCodecParameters *pLocalCodecParameters = pFormatContext->streams[i]->codecpar;


Using the properties of the codecs, we can find the corresponding one by requesting the avcodec_find_decoder function , we can also find the registered decoder for the codec identifier and return AVCodec - a component that knows how to encode and decode the stream:

AVCodec *pLocalCodec = avcodec_find_decoder(pLocalCodecParameters->codec_id);

Now we can print the codec information:

// specific for video and audio
if (pLocalCodecParameters->codec_type == AVMEDIA_TYPE_VIDEO) {
  printf("Video Codec: resolution %d x %d", pLocalCodecParameters->width, pLocalCodecParameters->height);
} else if (pLocalCodecParameters->codec_type == AVMEDIA_TYPE_AUDIO) {
  printf("Audio Codec: %d channels, sample rate %d", pLocalCodecParameters->channels, pLocalCodecParameters->sample_rate);
}
// general
printf("\tCodec %s ID %d bit_rate %lld", pLocalCodec->long_name, pLocalCodec->id, pCodecParameters->bit_rate);

Using the codec, we allocate memory for AVCodecContext , which will contain the context for our decoding / encoding process. But then you need to populate this codec context with CODEC parameters - we do this using avcodec_parameters_to_context .

After we have filled in the codec context, you need to open the codec. We call the avcodec_open2 function and then we can use it:

AVCodecContext *pCodecContext = avcodec_alloc_context3(pCodec);
avcodec_parameters_to_context(pCodecContext, pCodecParameters);
avcodec_open2(pCodecContext, pCodec, NULL);

Now we are going to read packets from the stream and decode them into frames, but first we need to allocate memory for both components ( AVPacket and AVFrame ).

AVPacket *pPacket = av_packet_alloc();
AVFrame *pFrame = av_frame_alloc();

Let's feed our packages from the streams of the av_read_frame function while it has the packages:

while(av_read_frame(pFormatContext, pPacket) >= 0) {
  //...
}

Now we will send the raw data packet (compressed frame) to the decoder through the codec context using the avcodec_send_packet function :

avcodec_send_packet(pCodecContext, pPacket);

And let's get a frame of raw data (an uncompressed frame) from the decoder through the same codec context using the avcodec_receive_frame function :

avcodec_receive_frame(pCodecContext, pFrame);

We can print the frame number, PTS, DTS, frame type, etc .:

printf(
    "Frame %c (%d) pts %d dts %d key_frame %d [coded_picture_number %d, display_picture_number %d]",
    av_get_picture_type_char(pFrame->pict_type),
    pCodecContext->frame_number,
    pFrame->pts,
    pFrame->pkt_dts,
    pFrame->key_frame,
    pFrame->coded_picture_number,
    pFrame->display_picture_number
);

And finally, we can save our decoded frame into a simple gray image. The process is very simple: we will use pFrame-> data , where the index is associated with the color spaces Y , Cb and Cr . Just select 0 (Y) to save our gray image:

save_gray_frame(pFrame->data[0], pFrame->linesize[0], pFrame->width, pFrame->height, frame_filename);

static void save_gray_frame(unsigned char *buf, int wrap, int xsize, int ysize, char *filename)
{
    FILE *f;
    int i;
    f = fopen(filename,"w");
    // writing the minimal required header for a pgm file format
    // portable graymap format -> https://en.wikipedia.org/wiki/Netpbm_format#PGM_example
    fprintf(f, "P5\n%d %d\n%d\n", xsize, ysize, 255);

    // writing line by line
    for (i = 0; i < ysize; i++)
        fwrite(buf + i * wrap, 1, xsize, f);
    fclose(f);
}

And voila! Now we have a 2MB grayscale image:


Chapter 1 - Sync Audio and Video ↑

Being in the game is when a young JS developer writes a new MSE video player.
Before we start writing transcoding code, let's talk about synchronization or how the video player finds out the right time to play a frame.

In the previous example, we saved several frames:


When we design a video player, we need to play each frame at a certain pace, otherwise it is difficult to enjoy the video either because it plays too fast or too slow.

Therefore, we need to define some logic for smooth playback of each frame. In this regard, each frame has a time representation mark ( PTS - from p resentation t ime s tamp), which is an increasing number taken into account in the variabletimebase , which is a rational number (where the denominator is known as the time scale - timescale ) divided by the frame rate ( fps ).

It’s easier to understand with examples. Let's simulate some scenarios.

For fps = 60/1 and timebase = 1/60000, each PTS will increase timescale / fps = 1000 , so the real PTS time for each frame can be (provided that it starts at 0):

frame=0, PTS = 0, PTS_TIME = 0
frame=1, PTS = 1000, PTS_TIME = PTS * timebase = 0.016
frame=2, PTS = 2000, PTS_TIME = PTS * timebase = 0.033

Almost the same scenario, but with timescale equal to 1/60:

frame=0, PTS = 0, PTS_TIME = 0
frame=1, PTS = 1, PTS_TIME = PTS * timebase = 0.016
frame=2, PTS = 2, PTS_TIME = PTS * timebase = 0.033
frame=3, PTS = 3, PTS_TIME = PTS * timebase = 0.050

For fps = 25/1 and timebase = 1/75, each PTS will increase timescale / fps = 3 , and the PTS time can be:

frame=0, PTS = 0, PTS_TIME = 0
frame=1, PTS = 3, PTS_TIME = PTS * timebase = 0.04
frame=2, PTS = 6, PTS_TIME = PTS * timebase = 0.08
frame=3, PTS = 9, PTS_TIME = PTS * timebase = 0.12
...
frame=24, PTS = 72, PTS_TIME = PTS * timebase = 0.96
...
frame=4064, PTS = 12192, PTS_TIME = PTS * timebase = 162.56

Now with pts_time we can find a way to visualize this in sync with the sound of pts_time or with the system clock. FFmpeg libav provides this information through its API:

fps = AVStream->avg_frame_rate
tbr = AVStream->r_frame_rate
tbn = AVStream->time_base


Just out of curiosity, the frames we saved were sent in DTS order (frames: 1, 6, 4, 2, 3, 5), but reproduced in PTS order (frames: 1, 2, 3, 4, 5). Also note how much cheaper B frames are compared to P or I frames:

LOG: AVStream->r_frame_rate 60/1
LOG: AVStream->time_base 1/60000
...
LOG: Frame 1 (type=I, size=153797 bytes) pts 6000 key_frame 1 [DTS 0]
LOG: Frame 2 (type=B, size=8117 bytes) pts 7000 key_frame 0 [DTS 3]
LOG: Frame 3 (type=B, size=8226 bytes) pts 8000 key_frame 0 [DTS 4]
LOG: Frame 4 (type=B, size=17699 bytes) pts 9000 key_frame 0 [DTS 2]
LOG: Frame 5 (type=B, size=6253 bytes) pts 10000 key_frame 0 [DTS 5]
LOG: Frame 6 (type=P, size=34992 bytes) pts 11000 key_frame 0 [DTS 1]

Chapter 2 - Remultiplexing ↑


Remultiplexing (rearrangement, remuxing) - the transition from one format (container) to another. For example, we can easily replace MPEG-4 video with MPEG-TS using FFmpeg:

ffmpeg input.mp4 -c copy output.ts

The MP4 file will be demultiplexed, while the file will not be decoded or encoded ( -c copy ), and, in the end, we get the mpegts file. If you do not specify the -f format , ffmpeg will try to guess it based on the file extension.

The general use of FFmpeg or libav follows such a pattern / architecture or workflow:

  • protocol level - accepting input data (for example, a file, but it can also be rtmp or HTTP download)
  • β€” , , ,
  • β€”
  • β€” (, ),
  • … :
  • β€” ( )
  • β€” ( ) ( )
  • β€” , , ( , , )


(This graph is heavily inspired by the work of Leixiaohua and Slhck )

Now let's create an example using libav to provide the same effect as when executing this command:

ffmpeg input.mp4 -c copy output.ts

We are going to read from input ( input_format_context ) and change it to another output ( output_format_context ):

AVFormatContext *input_format_context = NULL;
AVFormatContext *output_format_context = NULL;

Usually, we start by allocating memory and opening the input format. For this specific case, we are going to open the input file and allocate memory for the output file:

if ((ret = avformat_open_input(&input_format_context, in_filename, NULL, NULL)) < 0) {
  fprintf(stderr, "Could not open input file '%s'", in_filename);
  goto end;
}
if ((ret = avformat_find_stream_info(input_format_context, NULL)) < 0) {
  fprintf(stderr, "Failed to retrieve input stream information");
  goto end;
}

avformat_alloc_output_context2(&output_format_context, NULL, NULL, out_filename);
if (!output_format_context) {
  fprintf(stderr, "Could not create output context\n");
  ret = AVERROR_UNKNOWN;
  goto end;
}

We will remultiplex only streams of video, audio and subtitles. Therefore, we fix which flows we will use in an array of indices:

number_of_streams = input_format_context->nb_streams;
streams_list = av_mallocz_array(number_of_streams, sizeof(*streams_list));

Immediately after we allocate the necessary memory, we need to cycle through all the streams, and for each of which we need to create a new output stream in our context of the output format using the avformat_new_stream function . Please note that we flag all streams that are not video, audio or subtitles so that we can skip them.

for (i = 0; i < input_format_context->nb_streams; i++) {
  AVStream *out_stream;
  AVStream *in_stream = input_format_context->streams[i];
  AVCodecParameters *in_codecpar = in_stream->codecpar;
  if (in_codecpar->codec_type != AVMEDIA_TYPE_AUDIO &&
      in_codecpar->codec_type != AVMEDIA_TYPE_VIDEO &&
      in_codecpar->codec_type != AVMEDIA_TYPE_SUBTITLE) {
    streams_list[i] = -1;
    continue;
  }
  streams_list[i] = stream_index++;
  out_stream = avformat_new_stream(output_format_context, NULL);
  if (!out_stream) {
    fprintf(stderr, "Failed allocating output stream\n");
    ret = AVERROR_UNKNOWN;
    goto end;
  }
  ret = avcodec_parameters_copy(out_stream->codecpar, in_codecpar);
  if (ret < 0) {
    fprintf(stderr, "Failed to copy codec parameters\n");
    goto end;
  }
}

Now create the output file:

if (!(output_format_context->oformat->flags & AVFMT_NOFILE)) {
  ret = avio_open(&output_format_context->pb, out_filename, AVIO_FLAG_WRITE);
  if (ret < 0) {
    fprintf(stderr, "Could not open output file '%s'", out_filename);
    goto end;
  }
}

ret = avformat_write_header(output_format_context, NULL);
if (ret < 0) {
  fprintf(stderr, "Error occurred when opening output file\n");
  goto end;
}

After that, you can copy streams, package by package, from our input to our output streams. This happens in a loop, as long as there are packages ( av_read_frame ), for each package you need to recalculate PTS and DTS to finally write it ( av_interleaved_write_frame ) into our context of the output format.

while (1) {
  AVStream *in_stream, *out_stream;
  ret = av_read_frame(input_format_context, &packet);
  if (ret < 0)
    break;
  in_stream  = input_format_context->streams[packet.stream_index];
  if (packet.stream_index >= number_of_streams || streams_list[packet.stream_index] < 0) {
    av_packet_unref(&packet);
    continue;
  }
  packet.stream_index = streams_list[packet.stream_index];
  out_stream = output_format_context->streams[packet.stream_index];
  /* copy packet */
  packet.pts = av_rescale_q_rnd(packet.pts, in_stream->time_base, out_stream->time_base, AV_ROUND_NEAR_INF|AV_ROUND_PASS_MINMAX);
  packet.dts = av_rescale_q_rnd(packet.dts, in_stream->time_base, out_stream->time_base, AV_ROUND_NEAR_INF|AV_ROUND_PASS_MINMAX);
  packet.duration = av_rescale_q(packet.duration, in_stream->time_base, out_stream->time_base);
  // https://ffmpeg.org/doxygen/trunk/structAVPacket.html#ab5793d8195cf4789dfb3913b7a693903
  packet.pos = -1;

  //https://ffmpeg.org/doxygen/trunk/group__lavf__encoding.html#ga37352ed2c63493c38219d935e71db6c1
  ret = av_interleaved_write_frame(output_format_context, &packet);
  if (ret < 0) {
    fprintf(stderr, "Error muxing packet\n");
    break;
  }
  av_packet_unref(&packet);
}

To complete, we need to write the stream trailer to the output media file using the av_write_trailer function :

av_write_trailer(output_format_context);

Now we are ready to test the code. And the first test will be the conversion of the format (video container) from MP4 to MPEG-TS video file. Basically we create a command line ffmpeg input.mp4 -c to copy output.ts using libav.

make run_remuxing_ts

It works! Do not believe me ?! Check with ffprobe :

ffprobe -i remuxed_small_bunny_1080p_60fps.ts

Input #0, mpegts, from 'remuxed_small_bunny_1080p_60fps.ts':
  Duration: 00:00:10.03, start: 0.000000, bitrate: 2751 kb/s
  Program 1
    Metadata:
      service_name    : Service01
      service_provider: FFmpeg
    Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(progressive), 1920x1080 [SAR 1:1 DAR 16:9], 60 fps, 60 tbr, 90k tbn, 120 tbc
    Stream #0:1[0x101]: Audio: ac3 ([129][0][0][0] / 0x0081), 48000 Hz, 5.1(side), fltp, 320 kb/s

To summarize what we have done, we can now return to our original idea of ​​how libav works. But we missed part of the codec, which is displayed in the diagram.


Before we finish this chapter, I would like to show such an important part of the remultiplexing process, where you can pass parameters to the multiplexer. Suppose you want to provide the MPEG-DASH format, so you need to use fragmented mp4 (sometimes called fmp4 ) instead of MPEG-TS or ordinary MPEG-4.

Using the command line is easy:

ffmpeg -i non_fragmented.mp4 -movflags frag_keyframe+empty_moov+default_base_moof fragmented.mp4

It is almost as simple as this in the libav version, we simply pass the options when writing the output header, immediately before copying the packages:

AVDictionary* opts = NULL;
av_dict_set(&opts, "movflags", "frag_keyframe+empty_moov+default_base_moof", 0);
ret = avformat_write_header(output_format_context, &opts);

Now we can generate this fragmented mp4 file:

make run_remuxing_fragmented_mp4

To make sure everything is fair, you can use the amazing gpac / mp4box.js tool site or http://mp4parser.com/ to see the differences - first download mp4.

As you can see, it has one indivisible mdat block - this is the place where the video and audio frames are located. Now download fragmented mp4 to see how it extends mdat blocks:

Chapter 3 - Transcoding ↑


TLDR show me the code and execution:

$ make run_transcoding

We will skip some details, but don’t worry: the source code is available on github.

In this chapter, we will create a minimalist transcoder written in C, which can convert videos from H264 to H265 using the FFmpeg libav libraries, in particular libavcodec , libavformat and libavutil .


AVFormatContext is an abstraction for the media file format, i.e. for a container (MKV, MP4, Webm, TS)
AVStream represents each data type for a given format (for example: audio, video, subtitles, metadata)
AVPacket is a fragment of compressed data received from AVStream that can be decoded using AVCodec (for example : av1, h264, vp9, hevc) generating raw data called AVFrame .

Transmultiplexing ↑


Let's start with a simple conversion, then load the input file.

// Allocate an AVFormatContext
avfc = avformat_alloc_context();
// Open an input stream and read the header.
avformat_open_input(avfc, in_filename, NULL, NULL);
// Read packets of a media file to get stream information.
avformat_find_stream_info(avfc, NULL);

Now configure the decoder. AVFormatContext will give us access to all components of AVStream , and for each of which we can get their AVCodec and create a specific AVCodecContext . And finally, we can open this codec to go to the decoding process.

AVCodecContext contains media configuration data, such as data rate, frame rate, sample rate, channels, pitch, and many others.

for(int i = 0; i < avfc->nb_streams; i++) {
  AVStream *avs = avfc->streams[i];
  AVCodec *avc = avcodec_find_decoder(avs->codecpar->codec_id);
  AVCodecContext *avcc = avcodec_alloc_context3(*avc);
  avcodec_parameters_to_context(*avcc, avs->codecpar);
  avcodec_open2(*avcc, *avc, NULL);
}

You also need to prepare the output media file for conversion. First, allocate memory for the output AVFormatContext . Create each stream in the output format. To properly pack the stream, copy the codec parameters from the decoder.

Set the flag AV_CODEC_FLAG_GLOBAL_HEADER , which tells the encoder that he can use global headers, and finally open the output file for writing and save the headers:

avformat_alloc_output_context2(&encoder_avfc, NULL, NULL, out_filename);

AVStream *avs = avformat_new_stream(encoder_avfc, NULL);
avcodec_parameters_copy(avs->codecpar, decoder_avs->codecpar);

if (encoder_avfc->oformat->flags & AVFMT_GLOBALHEADER)
  encoder_avfc->flags |= AV_CODEC_FLAG_GLOBAL_HEADER;

avio_open(&encoder_avfc->pb, encoder->filename, AVIO_FLAG_WRITE);
avformat_write_header(encoder->avfc, &muxer_opts);

We get AVPacket from the decoder, adjust the timestamps and write the packet correctly to the output file. Despite the fact that the av_interleaved_write_frame function reports β€œ write frame ”, we save the package. We complete the permutation process by writing the stream trailer to a file.

AVFrame *input_frame = av_frame_alloc();
AVPacket *input_packet = av_packet_alloc();

while(av_read_frame(decoder_avfc, input_packet) >= 0) {
  av_packet_rescale_ts(input_packet, decoder_video_avs->time_base, encoder_video_avs->time_base);
  av_interleaved_write_frame(*avfc, input_packet) < 0));
}

av_write_trailer(encoder_avfc);

Transcoding ↑


In the previous section, there was a simple program for conversion, now we will add the ability to encode files, in particular, transcoding video from h264 to h265 .

After the decoder is prepared, but before organizing the output media file, configure the encoder.

  • Create a video AVStream in the encoder avformat_new_stream .
  • We use AVCodec with the name libx265 , avcodec_find_encoder_by_name .
  • Create an AVCodecContext based on the created avcodec_alloc_context3 codec .
  • Set the basic attributes for a transcoding session and ...
  • ... open the codec and copy the parameters from the context to the stream ( avcodec_open2 and avcodec_parameters_from_context ).

AVRational input_framerate = av_guess_frame_rate(decoder_avfc, decoder_video_avs, NULL);
AVStream *video_avs = avformat_new_stream(encoder_avfc, NULL);

char *codec_name = "libx265";
char *codec_priv_key = "x265-params";
// we're going to use internal options for the x265
// it disables the scene change detection and fix then
// GOP on 60 frames.
char *codec_priv_value = "keyint=60:min-keyint=60:scenecut=0";

AVCodec *video_avc = avcodec_find_encoder_by_name(codec_name);
AVCodecContext *video_avcc = avcodec_alloc_context3(video_avc);
// encoder codec params
av_opt_set(sc->video_avcc->priv_data, codec_priv_key, codec_priv_value, 0);
video_avcc->height = decoder_ctx->height;
video_avcc->width = decoder_ctx->width;
video_avcc->pix_fmt = video_avc->pix_fmts[0];
// control rate
video_avcc->bit_rate = 2 * 1000 * 1000;
video_avcc->rc_buffer_size = 4 * 1000 * 1000;
video_avcc->rc_max_rate = 2 * 1000 * 1000;
video_avcc->rc_min_rate = 2.5 * 1000 * 1000;
// time base
video_avcc->time_base = av_inv_q(input_framerate);
video_avs->time_base = sc->video_avcc->time_base;

avcodec_open2(sc->video_avcc, sc->video_avc, NULL);
avcodec_parameters_from_context(sc->video_avs->codecpar, sc->video_avcc);

It is necessary to expand the decoding cycle for transcoding a video stream:

  • We send an empty AVPacket to the decoder ( avcodec_send_packet ).
  • Get the uncompressed AVFrame ( avcodec_receive_frame ).
  • We begin recoding the raw frame.
  • We send the raw frame ( avcodec_send_frame ).
  • We get compression based on our AVPacket codec ( avcodec_receive_packet ).
  • Set the timestamp ( av_packet_rescale_ts ).
  • We write to the output file ( av_interleaved_write_frame ).

AVFrame *input_frame = av_frame_alloc();
AVPacket *input_packet = av_packet_alloc();

while (av_read_frame(decoder_avfc, input_packet) >= 0)
{
  int response = avcodec_send_packet(decoder_video_avcc, input_packet);
  while (response >= 0) {
    response = avcodec_receive_frame(decoder_video_avcc, input_frame);
    if (response == AVERROR(EAGAIN) || response == AVERROR_EOF) {
      break;
    } else if (response < 0) {
      return response;
    }
    if (response >= 0) {
      encode(encoder_avfc, decoder_video_avs, encoder_video_avs, decoder_video_avcc, input_packet->stream_index);
    }
    av_frame_unref(input_frame);
  }
  av_packet_unref(input_packet);
}
av_write_trailer(encoder_avfc);

// used function
int encode(AVFormatContext *avfc, AVStream *dec_video_avs, AVStream *enc_video_avs, AVCodecContext video_avcc int index) {
  AVPacket *output_packet = av_packet_alloc();
  int response = avcodec_send_frame(video_avcc, input_frame);

  while (response >= 0) {
    response = avcodec_receive_packet(video_avcc, output_packet);
    if (response == AVERROR(EAGAIN) || response == AVERROR_EOF) {
      break;
    } else if (response < 0) {
      return -1;
    }

    output_packet->stream_index = index;
    output_packet->duration = enc_video_avs->time_base.den / enc_video_avs->time_base.num / dec_video_avs->avg_frame_rate.num * dec_video_avs->avg_frame_rate.den;

    av_packet_rescale_ts(output_packet, dec_video_avs->time_base, enc_video_avs->time_base);
    response = av_interleaved_write_frame(avfc, output_packet);
  }
  av_packet_unref(output_packet);
  av_packet_free(&output_packet);
  return 0;
}

We converted the media stream from h264 to h265 . As expected, the version of the h265 media file is smaller than the h264, while the program has ample opportunities:

  /*
   * H264 -> H265
   * Audio -> remuxed (untouched)
   * MP4 - MP4
   */
  StreamingParams sp = {0};
  sp.copy_audio = 1;
  sp.copy_video = 0;
  sp.video_codec = "libx265";
  sp.codec_priv_key = "x265-params";
  sp.codec_priv_value = "keyint=60:min-keyint=60:scenecut=0";

  /*
   * H264 -> H264 (fixed gop)
   * Audio -> remuxed (untouched)
   * MP4 - MP4
   */
  StreamingParams sp = {0};
  sp.copy_audio = 1;
  sp.copy_video = 0;
  sp.video_codec = "libx264";
  sp.codec_priv_key = "x264-params";
  sp.codec_priv_value = "keyint=60:min-keyint=60:scenecut=0:force-cfr=1";

  /*
   * H264 -> H264 (fixed gop)
   * Audio -> remuxed (untouched)
   * MP4 - fragmented MP4
   */
  StreamingParams sp = {0};
  sp.copy_audio = 1;
  sp.copy_video = 0;
  sp.video_codec = "libx264";
  sp.codec_priv_key = "x264-params";
  sp.codec_priv_value = "keyint=60:min-keyint=60:scenecut=0:force-cfr=1";
  sp.muxer_opt_key = "movflags";
  sp.muxer_opt_value = "frag_keyframe+empty_moov+default_base_moof";

  /*
   * H264 -> H264 (fixed gop)
   * Audio -> AAC
   * MP4 - MPEG-TS
   */
  StreamingParams sp = {0};
  sp.copy_audio = 0;
  sp.copy_video = 0;
  sp.video_codec = "libx264";
  sp.codec_priv_key = "x264-params";
  sp.codec_priv_value = "keyint=60:min-keyint=60:scenecut=0:force-cfr=1";
  sp.audio_codec = "aac";
  sp.output_extension = ".ts";

  /* WIP :P  -> it's not playing on VLC, the final bit rate is huge
   * H264 -> VP9
   * Audio -> Vorbis
   * MP4 - WebM
   */
  //StreamingParams sp = {0};
  //sp.copy_audio = 0;
  //sp.copy_video = 0;
  //sp.video_codec = "libvpx-vp9";
  //sp.audio_codec = "libvorbis";
  //sp.output_extension = ".webm";

Hand on heart, I confess that it was a little more complicated than it seemed at the beginning. I had to pick the FFmpeg command-line source code and test a lot. Probably I missed something somewhere, because I had to use force-cfr for h264 , and some warning messages still pop up, for example, that the frame type (5) was forcibly changed to the frame type (3).

Translations on the Edison Blog:


All Articles