What Every Typesetter Should Know about Renderer Internals

ASS subtitle rendering is seen by many people as some arcane black magic. That's because it is. Luckily, you do not need to understand most of the black magic to learn many meaningful lessons about both the behavior and performance of renderers.

Let's get it over with right away and post the diagram:

A Venn diagram showing two completely disjoint circles. One circle is labeled "What typesetters think is bad for performance." The other circle is labeled "What's actually bad for performance."

This image has rightfully become a meme in the typesetting community due to the many, many misconceptions that many typesetters have about ASS rendering performance. This article goes into great detail about the inner workings of ASS renderers to clear up as many of these myths as possible. As a result, this article is quite long. If you are looking for a shorter summary, you can read the two sections at the end, but if you do any kind of serious typesetting, I strongly recommend you to read the entire thing when you can.

Definitions

I take some care to be consistent with my terminology here. None of the terms I use should be nonstandard in any way, but I try to only use one synonym of any specific term for one specific context to be less ambiguous. You shouldn't need to memorize these definitions to understand the article, but you can refer to it if you are confused by something.

An event or line is a single Dialogue: "entry" in an ASS subtitle file. That is, an event has a style, a layer, margins, an event text, and so on. (I do use both event and line here since there is no real risk of confusion with line breaks here. Event is the more technical term while line is the more informal one.)
A drawing is a shape that is drawn using the ASS drawing syntax, i.e. {\p1}m 0 0 l 100 0 ...
A character is a visible letter of text that is rendered in an ASS Event.
A glyph is a shape that a font provides for a certain character (or ligature, etc.).
A shape is either a glyph or a (parsed) drawing. That is, a shape consists of a sequence of vertices that should be connected by lines, bezier curves, or other splines. Drawing (or "rasterizing") a shape means filling in its interior, not stroking along its contours, even when drawing outlines.
An outline is the outline or border drawn around an ASS event with \bord
A run (or shape run) is a contiguous sequence of characters that have the same style parameters, or a single drawing.
Blur always means \blur unless otherwise specified. You should never use \be anyway.
A transform is a \t tag. A transformation is a spatial transformation applied to a line, i.e. positioning, scaling, rotation, and shearing or a combination of them. This may be the most nonstandard terminology out of all of these, but I am making the distinction here to avoid some confusion.

On the Different Renderers

The original ASS renderer was guliverkli's VSFilter. This was a DirectShow filter, meaning that it could hook into a video playback stream on Windows and draw subtitles directly onto the YCbCr video.

Since then, multiple variants of VSFilter have appeared, with some differences between them. The three most relevant ones are xy-VSFilter, MPC-HC's internal subtitle renderer (MPC-HC ISR for short), and VSFilterMod.

The former two are intended for live playback of subtitles, and are available in video players like MPC-HC. They only have comparatively minor changes in rendering over the original VSFilter. VSFilterMod, on the other hand, makes big extensions to the format, adding many new tags for effects like distortion, gradients, inserting images, and more. It can do this because it is not intended for live playback (so in particular it does not need to have realtime performance). VSFilterMod can be very useful when burning subtitles onto the video in advance, since it gives subtitle authors many more capabilities, but its extension to the ASS format cannot be used when authoring subtitles that are to be distributed as softsubs.

Since, for the most part, this article is centered around authoring softsubs, VSFilterMod is not relevant for most of it. Still, I am mentioning it here since it certainly has its place when creating hardsubs, and since it is important to understand this distinction between hardsubbing and creating softsubs intended for live playback.

So, back to xy-VSFilter and MPC-HC ISR. Both of these renderers are based on the original VSFilter, and hence use the Windows GDI and DirectWrite APIs for drawing text. This makes them Windows-only, which in turn meant that users on Linux and MacOS could not watch videos with ASS subtitles.

libass was created to solve this problem. libass is a cross-platform ASS rendering library that was written completely from scratch, without copying any of VSFilter's code. Unlike the VSFilter variants, libass does not use GDI and DirectWrite for rendering¹, so it can be used on all major operating systems. However, this also means that it needs to take great care to emulate GDI's behavior as closely as possible in order for its rendering to match VSFilter's.

Being cross-platform was not the only improvement that libass brought to the ecosystem. Over time, libass was improved with many performance optimizations that made its rendering much faster and more efficient. Today, libass is much faster than the VSFilters in many cases.

Furthermore, libass is still being actively developed, while development on VSFilter is mostly discontinued (with the little development that is being done on the VSFilters when necessary often being coordinated by libass's developers). Today, this makes libass the effective "standard" renderer for most users. libass is used in mpv (which is generally regarded as the best media player around, and the target of many fansub releases), as well as in many other players like VLC, Kodi, or even new versions of MPC-HC (though it is not selected by default).

Now, why do you need to know this? Well, many subtitle authors hear about how libass is the "standard" renderer nowadays, and conclude that "I only need to worry about libass when authoring subtitles." They may then include a text like "This release is intended for playback in mpv" in their releases and attribute any rendering errors on VSFilter to their viewers using the wrong player.

However, this is not correct. While libass is the most popular renderer and most viewers can and should use it, libass's rendering behavior is not authoritative for the format. Neither is the document that is sometimes presented as the "ASS Specification" (and neither are any third-party documents like Aegisub's manual).

While it's unfortunate for all involved parties, the reality is that there is no (authoritative) specification document for the ASS format. The ASS subtitle format is implementation-defined, with the reference implementation being VSFilter. This means that whenever a subtitle file looks different in libass than in VSFilter, it is VSFilter who is correct and libass who is wrong. While it has already gotten very close, libass does not yet match VSFilter's rendering behavior completely. Still, matching VSFilter's rendering is the library's eventual goal. This has an extremely important consequence for subtitle authors: Any behavior of libass that differs from VSFilter can change at any moment, and should hence not be relied on by subtitle authors.

This means that even if you ensure that your viewers will only ever use mpv to view your subtitles, if at the moment your script only renders "correctly" (as in, "how you want it to render") on libass and not on VSFilter, there is no guarantee that your script will keep rendering this way on libass in the future.²

This is something that often causes frustration among libass users, with comments along the lines of "Why break my subtitle file's rendering and stagnate the format just to make some 20-year old subtitles that nobody cares about render correctly?" This is easy to think, but it's important to realize that there exist old subtitle files that currently cannot be correctly played back outside of Windows, and that rectifying this situation is part of libass's goal. Moreover, the alternative isn't any better either: If we were to stop respecting VSFilter as the reference implementation, that would mean that no reference exists at all, and that libass is free to make up whatever behavior it likes at any moment. Unless it were to precisely specify every single edge case of its rendering (and Hyrum's law suggests that this may be a lost cause), any behavior that you might rely on could be deemed as "unintended" or "a bug" by libass in the future and changed. With VSFilter being the reference implementation, there is at least a reference implementation, and any libass behavior that matches VSFilter can be relied on to not change in the future (of course there could still be rendering regressions, but these can then be objectively deemed as bugs and be fixed).

The bright side of this story is that this only applies to rendering, not to performance. If you are sure that your viewers (or at least the viewers that you care about) will only use libass for rendering, you're free to rely on libass's performance and author subtitles that may stutter on VSFilter but still render smoothly on libass.

This is how we arrive at the slogan that I usually use here: You may target only libass for performance, but you should target the intersection of libass and VSFilter for rendering. That is the only way to author subtitles that render correctly on libass and are guaranteed to continue to render correctly in the future.

For a list of known libass/VSFilter differences you should watch out for, refer to the corresponding page on the libass wiki and the list of ASS quirks.

Finally, I should mention that this story gets more complicated when additional third-party authoring tooling gets involved. Many authoring tools (think Aegisub, or the various Aegisub scripts) have their own quirks in parsing ASS subtitles (or often do not even try to match VSFilter's parsing, and instead apply their own (incorrect) mental model of what the format could be). To give a few examples:

At the time of writing, Aegisub's manual claims that \fad() is a simple fade two arguments while \fade() is a complex fade that takes seven arguments. In reality, \fad and \fade are completely synonymous and can each be used with either 2 or 7 arguments. However, some automation scripts may not be aware of this, so even if it does not change rendering in any way, you may still want to use \fad for two-argument fades and \fade for seven-argument fades anyway.
Colors are often formatted as \1c&HFF0000&, but there is actually no need for either the leading &H nor the trailing & as far as the renderers are concerned. However, Aegisub typically formats colors this way so many automation scripts expect this format.

You can choose yourself how much you want to care about these quirks - since they will not affect the final rendering result this is only a matter of convenience for you as the author - but it is something to be aware of.

How Much to Rely on Internals

The rest of this article will explain part of how libass works internally (at the time of writing), since understanding this can be very helpful for understanding how to estimate or improve the rendering performance of your subtitle file.

Note that the concepts explained in the previous section also apply here: libass's only guarantee is that it will (try its best to) render subtitles the same way VSFilter does. It does not make any guarantees related to its performance for certain types of files, and especially not related to how any of its internals work. Understanding libass's internal workings can be helpful to get a general idea of which types of typesetting have which kinds of impacts on performance, but there is no guarantee that these internals will forever stay the way that they are right now. Of course, it is unlikely that libass will have any major performance regressions in common use cases, but it would not be impossible for libass to at some point become slightly slower in certain situations if that means becoming much faster or more accurate for other, more important cases. Still, the general gist of what kinds of operations are how expensive should probably stay the same in the long run.

Libass's Output Format

Let us start with something that is not even technically libass's internals, but actually its "externals," at least as far as the library is concerned. (Not that this makes any difference - the ultimately important part is that this is something that is not immediately visible to the Aegisub end user, but is still extremely helpful to understand. In fact, understanding this can also help with understanding the format's rendering output better.)

Let's say I am writing a video player and want to use libass to render subtitles: How exactly am I supposed to use libass?

It turns out that (given a subtitle file and a timestamp) libass does not simply take an RGB image and spit out an RGB image with the subtitles drawn onto them. In fact, libass does not even need the video frames at all; all you need to tell it is the video's resolution and the resolution you want to draw subtitles at. Instead, given a subtitle file and a timestamp, libass will return a sequence of monochromatic bitmaps with an alpha channel. More precisely, it will return a sequence of "bitmaps," where each bitmap consists of the following:

A width and a height
X and Y coordinates specifying where on the screen the bitmap should be positioned
An RGBA color
width*height alpha values from 0 to 255, specifying the transparency of each pixel in the bitmap's width x height rectangle.

It is then the player's responsibility to blend these bitmaps onto the video one after the other, in the given order. (Players may also need to apply some colorspace conversions to the colors of the bitmaps, but that is not too relevant for our purposes.) Players are free to do this however they please: Aegisub will "naively" blend the bitmaps onto the RGB frame on the CPU while mpv will pack all the bitmaps onto a single large texture and send that to the GPU, so that blending can be done on the GPU as part of mpv's resizing and colorspace conversion pipeline.

This output format has some very important consequences about the limitations of ASS rendering!³ Let me repeat the slogan again to really hammer this in: Any ASS subtitle file will result in a sequence of monochromatic alpha bitmaps. Any one of these bitmaps can only have a single color. It may have different levels of transparency at different points of its rectangle (the most common case being that the bitmap is fully opaque in some areas and fully transparent in other areas, with some transition in between those two areas), but its hue, saturation, and brightness cannot change throughout its rectangle.

How can a single subtitle line have multiple different colors, then? Well, one single subtitle line - and, in fact, one single shape run, i.e. one section of text with no tags or line breaks in between or one drawing - can result in more than one bitmaps. More specifically, one shape run can result in up to four bitmaps⁴:

One bitmap for the shape's fill
One bitmap for the shape's outline, obtained by thickening the shape, rasterizing that, and possibly subtracting the fill from it⁵
One bitmap for the shape's shadow, obtained by shifting the outline bitmap by a certain distance
One bitmap for karaoke highlighting (using \k, \ko, \kf, etc.), obtained from the shape's fill or outline, possibly cut off horizontally at some position

You may recognize these as exactly the four parts of a line that you can specify colors for via \1c, \3c, \4c, and \2c! And this is in fact exactly what these tags do in libass: They just control the color that is associated to the corresponding bitmap returned by libass.

So, what does this mean for the ASS format's capabilities and limitations? It means that any shape run can have at most four colors! And since, when doing advanced effects, these four bitmaps rarely interact with each other in the way you want, you can often only use one color per shape run.

So what do you do when you want more colors? Well, you need to make more shape runs. Either by splitting a single line into multiple shape runs with different colors (a gradient by character), or by creating many copies of your line with different colors and different positions/clips/etc (e.g a clip gradient). This is nothing new to a typesetter, but understanding the "monochromatic bitmaps" limitation can help one understand why these techniques have to work the way they do.

The fact that the bitmaps returned by libass are blended onto the video frame in sequence (as opposed to first being "added together" in some other way) also has some important consequences: The only way to fully cover the original video's content is with a fully opaque bitmap. One may be tempted to think that if two rectangles with 80% opacity are stacked on top of one another, they will completely mask the background since the two 80% opacity values add up to more than 100%, but this is not the case: The bitmaps are blended onto the video one after the other, so after blending the first bitmap, the background will still be visible with 20% opacity, so after blending the second bitmap it will still be visible with 4% opacity. This can be very important when dealing with blurred edges and/or antialiasing. For example, two opaque rectangles joined at an edge will completely cover their area, but when those two rectangles are blurred individually, the background will bleed through at the edge they share. The same will happen when the edge is not completely aligned to the pixel grid in display units, since then the edges will be slightly blurred due to antialiasing.

Finally, understanding how shapes render as four monochromatic bitmaps can also help you to understand other typesetting techniques like simple blur & glow setups: \blur is implemented by applying a gaussian blur to an individual bitmap's alpha channel. It does not change a bitmap's color or interact with any other bitmaps. Moreover, \blur is only ever applied to one bitmap of a shape at a time (but may then also be applied to the shadow or karaoke color as a consequence, since these are copied from the fill or outline): If the line has no outline, the line's fill (i.e. the fill bitmap's alpha channel) will be blurred. If the line has an outline, the outline is blurred but the fill is not. However, the line's outline bitmap is below the fill in rendering order, so blurring the outline will only blur the outer side of the outline, and not the transition from the outline to the fill.

So, if we want the transition from outline to fill to also be blurred, we have to create multiple copies of our line, make sure that one of them has a blurred fill and the other has a blurred outline, and then layer them correctly. This is exactly what the typical "Blur & Glow" workflow does.

Similarly, we can understand the "shad trick." Here, our setting is that we would like to thicken a shape, and then either strongly blur that thickened version or make it semi-transparent (or both). We know that we can get a thickened version of our shape through the outline, so we apply a large \bord to our shape. However, this does not quite work: If we now apply a \blur to our line, it will only blur the outline and not the fill, so the fill will always be fully opaque on top of the blurred outline. This is not a problem when we want a small blur, but if the blur radius⁶ is larger than the thickening distance, the blur will reach inside the shape's fill and the opaque fill on top of the blurred outline will create a hard edge.

The core problem here is that the outline and fill are two separate bitmaps. We can try to hide the fill using \1a&HFF, but this will cause the fill bitmap (before it is made transparent) to be subtracted from the outline bitmap, so that the outline will be hollow with a hard edge on the inside. Luckily, the shadow comes to our rescue. When our line has nonzero \bord, which it does, the base for the shadow bitmap will be a copy of the outline. However, if we add a \ko0 to our line (or give the fill a \1a&HFE, making it almost but not fully transparent - this was the old method before the \ko0 method was discovered), the fill is not subtracted from the shadow like it is from the outline. Hence, if we then force a shadow to be created with \shad0.001 and remember to hide all other bitmaps with \1a&HFF\2a&HFF\3a&HFF we get a beautiful single thickened bitmap. This is exactly what the "shad trick" is.⁷

Note, however, that the shad trick has a performance cost: Even though the fill and outline will be fully transparent, they will still be returned by libass and need to be blended by the player (at least at the time of writing). Hence, a shad trick line will be more expensive to render than a normal line with only a fill (three times as many bitmaps to blend, twice as many bitmaps to rasterize), so it should only be used when necessary.

The Single Biggest Performance Factor

We are now ready to talk about the single most important factor when it comes to the rendering performance of subtitles:

No matter how libass produces the bitmaps it outputs, the player needs to blend them onto the video in some way or another. The blending calculations themselves are not hard - CPUs can do math extremely quickly. The main "difficulty" (in the sense of performance) turns out to come from the sheer amount of data that has to be processed here, i.e. the amount of memory that needs to be accessed or copied in order to blend the bitmaps. This is a very common theme in image processing: Very often the main bottleneck is not CPU speed, but memory bandwidth (and, as a corollary, factors like cache size and locality).

Either way, we can come to the following conclusion: The dominating factor in ASS rendering performance is total bitmap size. That is, the sizes (width times height) of all the bitmaps output by libass, summed together. Certain caveats apply here, in particular involving caching (see below), but this slogan can already explain the vast majority of performance guidelines.

Let's give some basic examples:

A single letter or small-ish drawing, moving across the screen over time and rapidly changing color? No matter how complex the movement and \t transforms are⁸, the resulting bitmap will be fairly small so this will render very efficiently.
An opaque rectangle covering the entire screen? This may be a very short subtitle line, but it will generate a very big bitmap and hence have a fairly big impact on performance.
A gradient by character? Here, every character will have a different color and hence create its own shape run. This will mean that there will be a lot of bitmaps (one for each character's fill/outline/etc.). However, each of these bitmaps only contains a single character, so the individual bitmaps will be fairly small. As the slogan says, what matters is the total size, not the count of the bitmaps. All of the per-character bitmaps combined will be just as large as a single bitmap for the entire line would be (in fact, it can also be much smaller when rotations are involved, see below), so a gradient by character will not render any slower than a single-color line.

I say "just as large" here because the sizes may not match exactly: On one hand, splitting the line into characters allows spaces to be skipped, which results in some bitmap size savings. On the other hand, when the text has large outlines there may be some overlap between the individual per-character bitmaps, resulting in a slight increase in total bitmap size. But both of these are usually mostly negligible, so in summary we can say that splitting a line into individual characters has no big impact on performance.
A horizontal or vertical clip gradient? The situation is very similar here, at least when it comes to total bitmap size: There will be a lot of bitmaps, but each bitmap will just be a single-pixel strip. So, once again, the total bitmap size will be no different from the size of a single line with no gradient. We will see later on that there are some other factors here that play into why clip gradients are so efficient, but we can already see that the total bitmap size works out.
A diagonal clip gradient? This one is more of a problem. Once again, there will be many bitmaps for the many small strips of the shape, but this time the individual bitmaps will not be a single pixel wide or tall! Bitmaps are always rectangles, so a bitmap containing a narrow diagonal strip will need a comparatively large bitmap to contain it. As a result, a diagonal clip gradient will result in a large number of large bitmaps, and can hence have quite a high performance impact!

So, we see that understanding this total bitmap size rule can already tell you a lot about the performance of different types of subtitle effects. If you remember one thing from this article, it should be this rule. Still, there is more to learn about this topic, as we will see in the following sections.

Libass's Rendering Pipeline

Let me now explain in more detail how libass renders a subtitle file. Roughly, it performs the following steps⁹:

Given a subtitle file and a timestamp, find all events (i.e. subtitle lines) which are visible at the given timestamp.
For every such event:
- Parse the event's text. This goes through the event character by character, parsing override tags when ones are found. When tags like \t, \move, or \fad are encountered, their respective values at the current timestamp are computed. The result of this step is a list of text characters or drawings, together with a full list of style parameter values (including the font name, font size, color, blur, outline/shadow, scale, rotation, shear, etc.) for each individual character or drawing.
- Split the event's tags into runs, where each run is a contiguous sequence of characters (or a single drawing) that have the same style parameters. Each drawing gets its own run. An empty tag block {} will not split a run of characters, but it will split a drawing into two drawings and hence two runs.
- For each such run:
  - Turn each individual character or drawing into a shape. For characters, this entails looking up the font and reading the corresponding glyphs in the fonts.¹⁰ For drawings, it entails parsing the drawing text and turning it into libass's own internal binary format for a shape.
  - Transform the shapes according to their run's style parameters. This means applying position shifts, scaling, rotation, and shearing. These transformations happen on the level of shapes, that is, they transform a shape's end points and spline control points (as opposed to being somehow applied in raster space).
  - The shapes obtained from the previous step will be used for the fill of the run's characters or drawings. If the run has an outline set, libass will now take the fill shape and expand it by the outline width to obtain a thickened shape that will be used for the outline. Note that libass behaves differently from VSFilter here! VSFilter draws outlines in raster space, not in vector space. This can cause some rendering differences, in particular for zero-width contours, which will have a nontrivial outline with libass but not with VSFilter. As explained in the previous sections, this means that you should not rely on libass's stroking behavior for zero-width contours.
  - Rasterize the shapes obtained from the previous step to bitmaps.
  - Combine all of the run's fill and outline bitmaps into a single bitmap each.
  - If the run has a nonzero blur, blur the outline bitmap if there is one, otherwise blur the fill bitmap.
  - If necessary, subtract the fill bitmap from the outline bitmap. If the line has a non-integer shadow (in output units, not necessarily in PlayRes units), shift the outline or fill bitmap by the necessary amount to obtain the shadow bitmap. (Shifting the shadow by an integer amount can be done by just adjusting the position fields, without modifying the actual pixel data)
- We have now obtained up to three combined bitmaps (fill / outline / shadow) for each run. Next, apply rectangular \clip or \iclip as well as karaoke effects if present by cropping the bitmaps (or a copy of the bitmaps in the case of karaoke effects). This is actually very cheap, since it can be done by just modifying the coordinates, sizes, and pointers of the bitmaps, without modifying (or even copying) the underlying pixel data. In the case of a rectangular \iclip, this can be done by combining four rectangular cropped copies of the same bitmap, covering the four sides of the rectangle that is cut out of the bitmap. Here, making a "copy" of the bitmap (both in the case of \iclip and for karaoke effects) can be done by simply creating another bitmap object that refers to the same underlying pixel data buffer, so this is also very fast and does not need to make a copy of the pixel data.
- If the line has a vector clip, rasterize the vector clip to a bitmap and multiply each bitmap's output layer with that rasterized bitmap (or its inverse if it's an \iclip).
Having now obtained a set of bitmaps for each event, check if there are any events on the same layer that collide with one another, and move some of them out of the way if so.

Consequences for Performance

After understanding libass's rendering pipeline, we can now talk about which of these different steps affect performance the most (and which don't).

As a general rule of thumb, the rule about total bitmap size also holds here: The only steps that are really relevant for performance are the ones that somehow deal with bitmap data (that is, the ones that deal with the actual pixel data and do not just shuffle bitmap metadata like (the integer parts of) positions or colors around), and their performance is roughly proportional to the total size of the bitmaps they need to handle. Specifically, this means the following:

Parsing tags is not very expensive in the grand scheme of things, even when there are many tags or very many events. Neither are "animation" effects like moves, transforms, fades, etc. ASS rendering always happens on a specific timestamp, so a transform can just be replaced by a constant value (at that timestamp) when parsing. The fact that a subtitle changes from frame to frame does not make the subtitles any slower to render - the subtitles need to be rendered and blended every frame anyway, whether they change or not.¹¹

Parsing drawings can have a performance impact when there are very many different drawings with very long text. Most of the time the bitmap sizes are the bigger problem, though.
Similarly, splitting style runs, looking up fonts and glyphs in fonts, and transforming shapes is comparatively cheap and not worth worrying about.
The first computationally expensive step is rasterizing shapes. The shape's complexity (i.e. how many vertices there are per area) does affect the speed there to some degree, but the main factor is just the resulting bitmap's size which depends on the shape's bounding box. The boundary box is obtained from the font's metrics if the shape comes from a font glyph, or computed as the maxima and minima of all vertex coordinates when the shape comes from a drawing.
Combining multiple bitmaps into a single one simply scales with total bitmap size.
Blurring bitmaps - you guessed it - also scales with the size of the blurred bitmap. However, there is a bit more to say here:
- Out of the various steps scaling with total bitmap size, blurring is generally the most expensive one, if it is performed. Often there is little you can do to avoid blurring, but if you have a rectangle covering the entire screen with a \blur1 then you might want to ask yourself if you really need that blur or if you can achieve your effect without it.
- If \blur is present, the cost of a blur will be approximately constant with respect to the blur strength. That is, a very strong blur will not be any more expensive than a weak blur in principle.¹² However, a stronger blur will need to pad the bitmap by a larger amount so that the blur does not cut into the bitmap's edge. This in turn makes the bitmap larger, which makes the blur slightly slower. This should only be an issue on very strong blurs relative to the bitmap sizes, though.
  
  How strong will depend on the size of the blurred bitmap and the count of bitmaps. Empirically, a blur with a strength of S will increase the width and height of a bitmap by about the 7 times S pixels (assuming the display resolution is equal to the LayoutRes; otherwise everything will be scaled by the ratio of the two). Expanding a single 16x16 bitmap to 32x32 with a \blur2.5 is not a problem for a single bitmap, but it is still a growth by a factor of 4 and can become noticable when there are many such bitmaps. Expanding a single 16x16 bitmap to a 796x796 with a \blur100¹³ will have a big impact, but will probably not be necessary in practice.
- \be, on the other hand, gets more and more expensive the stronger its strength is. It also does not scale correctly and has issues with its padding, so you should never use it anyway.
Subtracting the fill from the outline once again scales with total bitmap size. The same holds for shifting the shadow's bitmap, but here it is important to mention that this shift will only be performed if there is actually a nontrivial distance to shift by, after rounding to 1/64th of an output pixel. Hence, an \xshad0.001 will be faster than an \xshad0.1, which will in turn be faster than a \shad0.1 (since there the shift has to be performed in both directions). If you find yourself needing shadows, check if you can get away with rounding all of your shadow offsets to integers. (But note that transformations like scale, rotation, and shearing will interfere with this, if present, and that this optimization will only work when the display resolution is a multiple of the PlayRes.)
Rectangular \clip and \iclip are actually almost free for the reasons explained above. Since \clip reduces the resulting bitmap's size, it can even make the rendering (or, to be more precise, the blending) cheaper. Similarly, this step of creating a new bitmap structure for the karaoke effect is almost free too, but of course it still creates another bitmap that must then be blended by the player.
Finally, there is the vector clip step. (As well as the collision detection, but that is basically free too.) Here, the vector clip has to be rasterized to a bitmap and applied to the event's bitmaps. Hence, this scales with both the total bitmap size and the size of the vector clip shape - whichever is larger.

The last two points are quite important! A rectangular clip is basically free, while a vector clip can be very expensive. This means that

You should never use a rectangular vector clip. If your clip can be a rectangular clip, it should be one.
You should not make unnecessarily large vector clips. Keep them as small as possible. At the very least, clip them to the frame's boundaries.
At least in a vacuum, it is more efficient to "bake in" a vector clip by intersecting it with your line's shape and using the resulting shape as a drawing (with no vector clip). This will usually get you better visual results anyway, since vector clip edges cannot be blurred while drawing edges can. But I say "in a vacuum" here since this can break caching when multiple events are involved, see below. In such situations which option will be faster will depend a lot on the specific situation and can probably only be determined with benchmarks, but either way, baking in vector clips should be an option to consider.

Finally, one additional corollary of the above pipeline is that, at least at the time of writing, clips are only considered very late in the pipeline. Events with large drawings or a lot of glyphs will still get fully rasterized, combined, blurred, and shifted, even if they have a very small rectangular clip that would hide the majority of the event (or even cut off some shapes or runs entirely). The same holds for the frame borders: Shapes are still fully rasterized, even if they are outside of the video frame. However, this could very well change in the future, especially for the frame borders.

Caching

If you read that last paragraph carefully (and have been paying attention), you might be confused now: Why exactly is a (horizontal or vertical) clip gradient so fast? Sure, the total bitmap size that libass outputs is the same as that of a normal line with no gradient, but that is only after applying the rectangular clips. Before applying the clips, there is a very large number of lines, each of which rasterizes to a bitmap (or sequence of bitmaps) that itself is the size of the entire line. That result in a massive total bitmap size for the rasterization step, right?

This is exactly where caching comes in. Libass will cache the results of most of its expensive operations. These include (at the time of writing):

Looking up a font from a font name (and bold/italic/vertical values)
Obtaining a shape from a font's glyph, as well as the glyph's metrics
Parsing a drawing into a shape
Thickening a shape
Rasterizing a shape to a bitmap
Creating a "composite" bitmap from all the per-shape bitmaps in a run by (if applicable):
- combining all the per-shape bitmaps of a run into a single bitmap for fill and outline each,
- applying blur,
- subtracting the fill from the outline, and
- shifting the shadow bitmap.
Note that this cache is for the combined operation of applying these four steps, not for each of the steps individually.

When libass is about to perform one of these operations, it will first check if it already has the output for the current input in its cache, and use that if so. It can do this without copying the pixel data: It can just work with a reference to the same buffer.

With this, we can now fully understand how a clip gradient has the performance that it does: A clip gradient consists of a lot of copies of the same line, just with different colors and different one-pixel-wide rectangular clips. Neither the color nor the rectangular clips affect the rasterization, so after rendering the first such copy, all later copies of the same line can skip the entire rasterizing and composing pipeline and just get the result from the cache. The result will be that all of these copies of the same line refer to the same composite bitmap, just with different crops and colors. As a result, we only have to rasterize once, resulting in a total rasterized bitmap size as large as our original line, and while there will be very many output bitmaps, each of them is very small so the total output bitmap size is also only as large as our original line. This is what makes clip gradients almost as fast as a single-colored line.

So, we can now make our golden rule of total bitmap size more precise: Without any caching, the cost of rendering a frame of subtitles is mostly proportional to the total bitmap size before applying rectangular clips, plus the total bitmap size of all vector clips. With perfect caching, the cost can be lower, but it will never be less than (proportional to) the total bitmap size output by libass, i.e. after applying rectangular clips. In reality, the cost will be somewhere in between these two values, depending on how many things can be cached.

Once again it is hard to give concrete advice here outside of very specific context, but some general consequences of this are the following:

"Simplifying" events, in particular drawings, by removing invisible contours can actually be harmful when applied per-event in a context where there are many copies of the same event. We have already seen that a drawing's "complexity" does not contribute that much to its performance cost, so purely "simplifying" a drawing will not improve its performance by much.

On the contrary, simplifying each drawing individually can break caching: Suddenly, there are many slightly different versions of the same shape, rather than a lot of copies of a single shape that can be cached. In particular, ASSWipe (an Aegisub automation script that, among other things, will "purge invisible contours") can be counterproductive in some cases.

Simplifying events can be a great performance improvement when it cuts down - you guessed it - total bitmap size, for example by culling parts that are off-screen, outside of a clip, or behind some other opaque shape. Of course, this may still negatively affect clipping, but how exactly these two factors weigh up against each other can only be determined by benchmarks on a case-by-case basis.
Similarly, baking a vector clip into a drawing will save the cost of blending the vector clip, but can hurt caching in certain cases (like when applied to many copies of the same line with e.g. different transformations or different vector clips). Once again, which option is better will depend on the specific situation.
In general, when you can ensure that multiple variants of the same event (in a performance-critical setting) use the exact same shape (glyph or drawing) and positioning without making any big changes to your subtitle file, it may be a good idea to do so.
One very important caveat here is that an event's position can affect its rasterization (as can other transformations like scaling, rotation, and shearing, but that may be less surprising), and hence break the caching of its rasterization (and hence the composite caching too). Specifically, the fractional part of a shape's X and Y position (in output units) rounded to multiples of 1/8 affect the shape's rasterization. That is, two lines that are identical except that one of them has a \pos of (100,200) and the other one has a \pos of (300,400), then assuming that the rendering resolution is equal to the PlayRes they will rasterize to the same bitmap and hence one line can use the cached rasterization of the other. In fact, the same will happen if one line has a \pos of (300.05,400), since 300.05 rounds to 300 when rounded to multiples of 1/8. However, a line with a \pos of (300.5,400) or even (300.1,400) will rasterize to a different bitmap and hence not be able to use the other line's cached rasterization. Hence, rounding a line's position to integer values can be beneficial for caching, at least when the rendering resolution is a multiple of the PlayRes. (And, as a converse, differently positioned copies of the same shape may not always be able to use caching in the way that you expect.)

In summary, caching can cause massive performance savings in specific cases like clip gradients, but can also easily break. I would recommend not worrying too much about making sure to make use of caching in a situation that does not strictly need it. What is more important is being aware of the situations that do strictly need caching (which are mainly clip gradients) and making sure not to break those for any of the reasons listed above.

Note, also, that all of these points only apply when there are multiple copies of the same shape (on the same frame) involved. If a specific shape only appears once (per frame), you don't need to worry about breaking caching in the first place.

At the time of writing, libass's caches are preserved throughout when rendering a single frame, but may be cleaned up between frames when they get too large. While libass's caching behavior for a single frame is somewhat predictable, I would not recommend relying on libass's caching across frames. In fact, I do not recommend relying on any kind of behavior across frames at all when it comes to performance. Instead, ensure that each individual frame renders fast enough, even if it is the first frame that libass will ever render. If you are checking performance in mpv, you can resize mpv's window (e.g. by exiting and reentering fullscreen) to clear all of libass's bitmap caches.

Note that what is not currently cached is the process of applying a vector clip (the rasterization is cached, but the multiplication is not). This could change in the future, though.

Bitmaps are Rectangles

In the previous sections we have learned the following two facts:

Performance mainly depends on total bitmap size
The fill and outline of each "run" of a subtitle event is rasterized to a single bitmap (from which the shadow and karaoke effect, if present, are then derived).

Recall that a "run" was a contiguous sequence of characters that have the same style parameters, or a single drawing.

A bitmap is always rectangular, so each bitmap has to be chosen large enough to fit its entire run into its rectangle. For your run-of-the-mill (heh) horizontal (or even vertical) text, this is not an issue, but it can be a big problem once rotations or drawings get involved!

For example, take a line like {\frz45}A very very very very very very very long diagonal line. Since there are no changes in override tags values in the middle of this line, this line will split into a single run. The bitmap containing this run will be a fairly large square, but most of its space will be "wasted"! That is, the majority of the bitmap will be transparent, and only the diagonal will have opaque pixels. Of course, the blurring and blending code will not know this and happily blur and blend all the fully transparent pixels. This makes this line much more expensive than it needs to be.

This can be fixed by forcing the line to be split into multiple runs by changing some tag values in between each character. A good tag to use for this is \2a, since you rarely actually need to set it. The result could look like {\frz45}A{\2a1} {\2a0}v{\2a1}e{\2a0}r... (or use \2a&H00& and \2a&H01& if you want to appease broken third-party tools). This way, every individual character will receive its own, much smaller rectangle around it, and the resulting total bitmap size will be much smaller.

Similarly, consider a drawing like {\an7\p1}m 0 0 l 100 0 100 100 0 100 m 1000 1000 l 1100 1000 1100 1100 1000 1100. This will draw two 100x100 squares that are 1000 pixels (in PlayRes units) apart from one another. Since this is a single drawing, this will hence result in a single 1100x1100 bitmap, even though only two 100x100 squares are actually used. Again, this will hence be much more expensive to render than it needs to be. Like before, this can be fixed by splitting the two squares into separate runs, and hence separate bitmaps. You can do this inserting a {} in the middle of the drawing (note that this only works for drawings, not for text!), but this will then shift the second run by the width and height of the first one. You can compensate for this by shifting the second component in the opposite direction (arriving at {\an7\p1}m 0 0 l 100 0 100 100 0 100{}m 100 900 l 1100 900 1100 1000 1000 1000), but the much simpler method is to just split the drawing into two separate events. Third-party tooling will be able to deal with that better, too.

To give one more example, let's suppose we want to draw a border around the entire frame. (That is, for example, draw a 10-pixel-wide strip at each edge of the frame.) Even if the resulting drawing would be connected, using a single drawing for this would once again waste a huge amount of bitmap space, since the resulting bitmap would be as large as the entire frame. Instead, it would be much more performant to split our border into four different drawings, one for each side.

However, things aren't always as simple. Remember that bitmaps are blended onto the image one after the other. Hence, if transparency is involved¹⁴, blending a shape with two components as two separate bitmaps will create a different visual result than blending it as a single bitmap if the components overlap. (And even if the shapes themselves do not overlap, they might overlap after being given an outline and/or being blurred) For example, consider a line like {\an5\fs200\bord30\shad0\3a&H80&}AA. Here, the outlines of the two characters are a single transparent bitmap, so they have a constant opacity throughout. However, if we replace the line with {\an5\fs200\bord30\shad0\3a&H80&}A{\2a1}A, suddenly the overlap between the two outlines becomes much darker¹⁵! This is because the two outlines are now two separate transparent bitmaps, so blending them one after the other makes their overlap darker than the rest.

The upside of this is that this gives us a very easy way to check if a certain line has a run break or not - or more generally which shapes belong to separate bitmaps and which belong to the same one. Just make everything half-transparent and, if needed, give it a huge \bord so that the outlines overlap, and check if the intersections become more opaque or not. (Just make sure that changing the transparency and/or the \bord does not add or remove additional run breaks.)

The downside, however, is that this can cause very undesirable rendering in situations where you want to, or have to use separate bitmaps. For example, forcing a run break at every character in our example of a very long line with {\frz45) worked well to reduce total bitmap size, but if our line has a large semi-transparent outline and/or a large and blurred outline, splitting the line into separate runs will cause the overlaps between outlines to look bad. (This can also happen if the line has no outline and a strongly blurred fill, but it's usually much less noticeable there.)

This can be worked around in some situations by using rectangular clips¹⁶ to ensure that only one (output) bitmap covers every point in the line. This can be done by making a bunch of duplicates of your event (remember, this will then benefit from caching), and giving each duplicate a small rectangular clip such that

a) All of the clips are disjoint.

b) All of the clips together cover the entire rendered output.

c) Whenever two shapes overlap, there are two clips, each fully containing one of the shapes but not fully containing the other.

Then, you can selectively hide individual runs in invidivual events with {\alpha&HFF&} (remember to restore all alpha channels to their previous values after the runs) so that every point in the visible output is covered by exactly one visible run.

This works quite similarly to a clip gradient, the only difference being that it selectively toggles visibility instead of making a linear color gradient. In particular, it can benefit from the composition cache just like a clip gradient will.

A toy example for this is are the following lines (on a 1920x1080 PlayRes):

{\pos(960,540)\an5\fnArial\fs200\bord30\shad0\3aH&80&\clip(800,400,960,700)}A{\2c1\alpha&HFF&}A
{\pos(960,540)\an5\fnArial\fs200\bord30\shad0\3aH&80&\clip(960,400,1200,700)}{\alpha&HFF&}A{\2c1\alpha\3a&H80&}A

However, this is very finnicky and no good tooling for this exists at the moment.

Mythbusting

This finally concludes most things that can and should be said about rendering performance. In these last two sections, I will summarize everything I discussed above with some concrete advice, and bring up a few miscellaneous points that haven't been mentioned in detail yet.

Let us start with a "mythbusting" section: I have talked at length about which parameters affect performance the most, which in turn means that most of the other parameters do not significantly affect performance. There exist many popular misconceptions about ASS rendering performance, so let's spend some time to explicitly address them here.

To clarify: The statements in the following list are true, but are written to contradict myths that assert the opposite.

libass does not completely supersede VSFilter. Any behavior of libass that differs from VSFilter is not guaranteed to stay this way in the future unless libass explicitly promises it.
Changing a line's parameters over time with \move, \t, or \fad does not affect performance. Rendering happens one frame at a time, and once the frame timestamp is known a \move is not any different from a \pos.
Using drawings does not significantly affect performance (if the drawings are a similar size as a run of characters would be given your current styling). Both drawings and characters are just converted to shapes internally and then rendered in the exact same way.

If you have very long drawing strings repeated many times across a file, it may be slightly more efficient to create a custom font that has your drawings as glyphs. This is simply because fonts use a binary format and can hence store shapes more compactly than drawing strings. But, in the vast majority of cases, the added cost of parsing a drawing is negligible compared to the cost of rasterizing, blurring, and so on.
The complexity of a drawing (that is, the number of vertices) does not have a big effect on its performance cost, at least not until it reaches absurd levels.
Having separate events for every single frame does not significantly affect performance.
As a result of the previous two points, a subtitle file's file size does not directly affect performance. Yes, an absolutely massive file will likely perform worse than a smaller one on average, but only because a larger file will contain more events, which means more bitmaps, which means higher total bitmap size.

In particular, splitting an event with a few seconds of duration into a sequence of frame-by-frame events can greatly increase the file size, but will have almost no effect on performance. You may have other reasons to worry about subtitle file sizes, but it does not help to worry about it for performance reasons.
Large \blur is not much more expensive than smaller \blur (but it is more expensive than no \blur at all) unless the blurred bitmap is very small. The added cost mainly comes from needing to pad the bitmap more, not from the larger blur strength itself.
Rectangular clips are not expensive at all. In fact, they are almost free and can even improve performance in some cases.
Simplifying a drawing by removing invisible contours (as ASSWipe does) may not actually improve performance when that drawing is used many times.
Hiding a bitmap using \alpha&HFF& will not improve performance, at least at the time of writing. This may change in the future, so it is better to use \alpha&HFF& than not using it, but at the moment you should not rely on it improving performance. To really remove bitmaps, delete or comment the line and/or use \bord0 and/or \shad0 to remove outlines and/or shadows.

In particular, the shad trick is significantly less efficient than using a single fill bitmap.

Summary of Guidelines

Finally, let us summarize what we have learned in this article with more concrete advice. Some of the advice here is simplified for the sake of brevity; read the above sections for the finer details.

You may target exclusively libass for performance, but you should test your subtitle's rendering on both libass and VSFilter and make sure they agree.
Make sure that your subtitles also render correctly (and, ideally, perform well) when the display resolution is not a multiple of your PlayRes.
The single biggest factor when it comes to rendering performance is total bitmap size. If you remember one thing from this article, it should be this.
Avoid large bitmaps that are mostly empty, like long diagonal text or sparse drawings. Split up your lines into multiple runs if you can.
Avoid blurring very large bitmaps if you can. Avoid blurring very small bitmaps by extreme amounts (or at least be aware that this can greatly increase their bitmap size).
Avoid using fractional \shad on very large bitmaps if you can - use either an integer value like \shad1 or a value like \shad0.001 that is small enough to round to 0 in units of 1/64th of an output pixel.
Do not use \be.
Avoid using large vector clips. Consider baking vector clips into drawings when feasible.
Avoid using diagonal or radial clip gradients when possible. Using a small number of strongly blurred shapes for gradients may be more efficient.
In situations that strongly rely on caching, in particular clip gradients, take care not to break the caching by changing parameters other than colors (so in particular also position) between copies of the line.
In particular, take great care when using tools like ASSWipe to simplify drawings; they might break caching.
Do not use the shad trick unless you are sure you need it.
Do not rely on shapes that are outside of the frame borders or outside of your line's rectangular clip being removed and not having a performance cost. If you have a very wide line horizontally scrolling across the frame, break it up into multiple sections and delete the invisible sections at each point in time.
Prefer \bord0, \shad0, and commenting or deleting lines or sections to using \alpha&HFF& for hiding lines or sections when possible, unless you believe it interferes with caching.
Do not rely on caching across frames. Ensure that each individual frame renders fast enough, even if it is the first frame libass ever renders.
If you have a very long drawing that is repeated many times, consider encoding it as a font. This should not be the first tool you reach for when improving performance, though.
When in doubt about which method has better performance in a specific case, run benchmarks.

It can still use them for looking up fonts on Windows, but that is not too relevant here. ↩
At this point it should be mentioned that libass maintains a list of libass's ASS extensions on its GitHub wiki. If you know what you are doing and are aware of the risks (or if you cannot avoid it, as could be the case when working with bidirectional text), you can rely on these extensions. However, you should be aware that these extensions are not precisely specified (more on this below) and could hence also have subtle changes in behavior in the future. ↩
Of course, the way I have described this behavior it is simply an implementation detail of libass that a priori has nothing to do with the ASS format. In reality it is the other way around: VSFilter also constructs monochromatic bitmaps out of the subtitles internally, which it then blends onto the video. This causes ASS rendering to behave in the way it does, which in turn allows libass to choose this output format. I am focusing on libass here since it is the main subject of this article and because it is a bit simpler to describe, but either way, understanding that any ASS shape will only ever be able to result in a small number of monochromatic bitmaps with alpha is a very helpful way to internalize the format's abilities and limitations. ↩
Actually, that is not quite true. Any one of these bitmaps can turn into four smaller bitmaps when a rectangular \iclip is involved. But these four bitmaps will then all have the same color, so this does not change anything about the conclusions we will draw from this simplified statement. ↩
Or as a simple rectangle in the case of BorderStyle=3 or 4 ↩
I am using "blur radius" as an informal term here, it is not necessarily the same as the number specified in the \blur tag. (In particular a Gaussian blur has always infinite "radius" in theory.) What I really just mean here is the blur being strong enough that it visibly reaches inside the shape's fill. ↩
The shad trick also has a second benefit: The shadow's position can be changed freely throughout a line's using \t(\xshad) and \t(\yshad) while a line's position can only be changed using \move which only allows for a single linear movement in a certain time interval. Hence, the shad trick can be used to create a moving line in a more compact way (i.e. resulting in a smaller subtitle file). This will usually not help improve performance (in fact it may make it worse due to the cost of the shad trick), but it can help in cases where creating frame-by-frame events would make your file size blow up by extreme amounts, like when frame-by-frame tracking a very complex drawing. Though then again, maybe you should turn such a drawing into a font instead. ↩
Of course, if you have a thousand transforms in a single line they might also start having an impact on performance. But you should simply never need this in practice. Even when transforming five different tags frame-by-frame, you'll probably still only have a couple dozen transforms, which will be perfectly fine. For the rest of this article, please disregard pathological cases (i.e. cases that are simply redundant on a syntax level) like these when I use phrases like "no matter how complex." ↩
This is not a complete list and may not be in the exact correct order in all cases, it just lists the steps that are relevant for understanding rendering and rendering performance in most common cases. ↩
It also entails doing font shaping beforehand, which involves laying out bidirectional text and applying ligatures and diacritics, but this is not too relevant for our purposes. The main takeaway is that there may not be a one-to-one correspondence between "characters" (more specifically, unicode codepoints) in the event text and rendered glyphs. ↩
Again, this is actually not completely true. Libass does have some optimizations for subtitles that do not change from frame to frame. We will talk about caching in detail later on, but libass is also able to communicate to the player that "the subtitles for this frame are exactly the same as for the last frame you rendered", or that "the subtitles for this frame have the same bitmaps and colors as the last frame you rendered; only their positions have changed." However, this optimization is not too relevant: Apart from the fact that the player still needs to blend the bitmaps either way, this only works if the subtitles stay the exact same across frames. As soon as a single event changes, this optimization no longer triggers. Moreover, I would advise against relying on behavior across frames in the first place. It does not help if your second frame renders very quickly if the first one already causes lags, and the viewer could seek around the video and break your assumptions about which frames are rendered after which other frames. It is much better to just optimize every frame's rendering independently. ↩
libass uses some extremely cool mathematics to achieve this. If you are interested in how this works, check out the paper the author wrote about this algorithm. ↩
\blur100 is the largest possible blur value in libass, at the time of writing. VSFilter allows larger blur values, so you should not rely on this capping. ↩
And transparency will always be involved to some degree since shapes will be anti-aliased when rasterized. Even if you only have strictly horizontal or vertical edges at integer PlayRes coordinates, those coordinates can become fractional when the subtitles are rendered at a different resolution. ↩
Assuming a standard style with white fill and a black outline. ↩
I do not recommend using vector clips here, even if they only contain horizontal and vertical lines aligned to PlayRes pixel grid. The performance impact aside, these will not look correct when the subtitles are played back on any other resolution than a multiple of the PlayRes. ↩

arch1t3cht/renderer_internals.md

Select an option

No results found

Select an option

No results found

What Every Typesetter Should Know about Renderer Internals

Definitions

On the Different Renderers

How Much to Rely on Internals

Libass's Output Format

The Single Biggest Performance Factor

Libass's Rendering Pipeline

Consequences for Performance

Caching

Bitmaps are Rectangles

Mythbusting

Summary of Guidelines

arch1t3cht/renderer_internals.md

What Every Typesetter Should Know about Renderer Internals

Definitions

On the Different Renderers

How Much to Rely on Internals

Libass's Output Format

The Single Biggest Performance Factor

Libass's Rendering Pipeline

Consequences for Performance

Caching

Bitmaps are Rectangles

Mythbusting

Summary of Guidelines

Footnotes