Subtitle Tools in C Language
I decided to release under GNU GPL some of the C language tools I've written over the years for working with subtitle files. These include SubRip (.srt) tools, a SubStationAlpha (.ssa)-to-SubRip (.srt) converter, a VobSub (idx/sub pair of files) tool to extract subtitle images as bitmaps as well as synchronize the timestamps to new anchor points, a PGS (.sup file) tool to extract subtitle images as bitmaps as well as synchronize the timestamps to new anchor points. Some tools for detecting character-encoding and converting to UTF-8 are also included.
In addition, I have developed routines for managing YCbCr-to-RGB, RGB-to-YCbCr, and derivation of BT.709 RGB colorspace constants from BT.709 color primaries.
Before using a SubRip file as input to any of the tools listed below, it's generally a good idea to run check.c on it to ensure it's in the proper format.
Table 1: | SubRip (.srt) Tools |
check.c | Check for errors in a SubRip (.srt) file and report results. Compile: gcc -Wall check.c -o check Usage: ./check inputfilename.srt If it finds an error, fix it and re-run check again until no errors are found. Warnings will be reported if subtitle numbers (not referring to timestamps here) are not in proper ascending order. These can be ignored if you wish, as most media players ignore the value as long as an integer number is used. You can use offset.c with 0 offset values as input and it will renumber the subtitles correctly. If timestamps/subtitles are correct but the starting timestamps are not in ascending order, you can use reorder.c to sort subtitles by starting timestamp to ensure they are chronological. Output: reports to stdout |
offset.c | Read an existing SubRip (.srt) file, apply positive, negative, or no offset to the time stamps, and save in an output file. Subtitle durations are preserved. Ignores existing subtitle numbers and renumbers them from 1 to N. I often use it just to renumber (for example, because I added/inserted or removed some subtitles) by entering 0 for the offsets. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall offset.c -o offset Usage: ./offset inputfilename.srt Output: out.srt |
sync.c | Read an existing SubRip (.srt) file and synchronize all timestamps to user-input anchor-points. Subtitle durations are preserved. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall sync.c -o sync Usage: ./sync inputfilename.srt Output: out.srt Synchronization is accomplished by using "first" and "last" timestamps as anchor-points. Choose "first" and "last" subtitles that are near or at the beginning and end of the feature in order to maximize scaling accuracy. For example, if the existing timestamps for subtitles appearing early and late in the feature are: "first": 00:00:29,280 --> 00:00:31,880 "last": 01:31:29,280 --> 01:31:30,920 and new start times for these subtitles are to be: "first": 00:00:22,280 "last": 01:31:25,000 then you would use sync.c like this:
|
sub.c | A tool to analyze an .idx/.sub pair of VobSub files and produce a report file. Optional functions include producing a bitmap file for each subtitle, applying an offset to timestamps, and synchronizing all timestamps to new anchor-points. Subtitle durations are preserved when applying offsets or synchronizing to new anchor points. The idx/sub input files could be derived from an NTSC or PAL DVD, or HD (1080p) or UHD (4K) BluRay. Note that some idx/sub files have multiple languages of subtitles, and all subtitles from all languages will be extracted to bitmap when bitmap files are requested. e.g., an idx/sub file with 25 languages and ~800 subtitles per language will produce a lot of bitmap files. The report file sub.out is typically of similar size to the .sub file. Compile: gcc -Wall sub.c -o sub Usage: ./sub filename.idx filename.sub [option]
Output: sub.out, and optionally: bitmap file for each subtitle, or offset/resynchronized file out.sub and accompanying revised file out.idx. Bitmap filenames include: start and end times (hh_mm_ss_ms__hh_mm_ss_ms), the language ID and language index, and then .bmp. The language index is included because you can have multiple subtitle streams for the same language and they need to be differentiated. SubTitleEdit can read the timestamps from the bitmap filenames if they don't include the language ID and index. Modify the write_bmp() function to suit your needs. Note that SubTitleEdit cannot read the timestamps from similarly named .txt files; for that see txtfiles2srt.c below. |
txtfiles2srt.c | Takes the timestamps from the filenames in a collection of individual subtitle text files, each containing the text for a subtitle, and produces a single SubRip (.srt) file. No Byte Order Mark (BOM) is prepended. Each subtitle text file should contain only lines of text without blank lines or trailing line-feeds. Compile: gcc -Wall txtfiles2srt.c -o txtfiles2srt Usage: ./txtfiles2srt filelistfilename Output: out.srt filelistfilename is a text file containing only a list of the subtitle text files. Each filename is expected to be: hh_mm_ss_ms__hh_mm_ss_ms.txt. For example: 00_13_11_959__00_13_15_213.txt |
ssa2srt.c | Read an existing SubStationAlpha (SSA) file and convert to SubRip (.srt) output file. Transfers styles and markups for font color, bold, italic, underline, strikeout, and alignment. Only recognizes V4 and V4+ Styles. Ignores those SSA style attributes and override tags not implemented in SubRip format. Warning: Unlike SubRip files, SSA files don't require subtitles to be in chronological order. The SubRip output file is not corrected for this rare situation; use reorder.c below. Compile: gcc -Wall ssa2srt.c -o ssa2srt Usage: ./ssa2srt inputfilename Output: out.srt |
ssa2srt-nostyles.c | Read an existing SubStationAlpha (SSA) file and convert to SubRip (.srt) output file. Doesn't transfer styles, only markups for font color, bold, italic, underline, strikeout, and alignment. Warning: Unlike SubRip files, SSA files don't require subtitles to be in chronological order. The SubRip output file is not corrected for this rare situation; use reorder.c below. Compile: gcc -Wall ssa2srt-nostyles.c -o ssa2srt-nostyles Usage: ./ssa2srt-nostyles inputfilename Output: out.srt |
reorder.c | Re-order non-chronological subtitles in a SubRip (.srt) file by sorting on start times. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall reorder.c -o reorder Usage: ./reorder inputfilename.srt Output: out.srt |
srt2txt.c | Read an existing SubRip (.srt) file and save only the text lines to an output file. This is useful if you want to submit the text to a translation tool/service without the subtitle numbers and timestamps. Once translated, you can use txt2srt.c to convert back to a SubRip file. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall srt2txt.c -o srt2txt Usage: ./srt2txt inputfilename.srt [nospace] Output: out.txt Subtitle texts will be separated by blank lines unless nospace option is specified. |
txt2srt.c | Take the timestamps from a SubRip (.srt) file and the text from a text file and create a new .srt file. The text file must have the same number of subtitles as the SubRip file, and they must be separated by single blank lines. If present in the text file, the Byte Order Mark (BOM) will be included in the output file. Compile: gcc -Wall txt2srt.c -o txt2srt Usage: ./txt2srt inputfilename.srt inputfilename.txt Output: out.srt |
tag.c | Take formatting tags from one SubRip (.srt) file and text from another SubRip file and create a new SubRip file. This is typically used if you have removed all formatting from a .srt file using striptag.c and converted to text file using srt2txt.c in order to translate. You can convert the translated text file back to .srt using txt2srt.c and then add back the formatting with tag.c. If a Byte Order Mark (BOM) exists in the SubRip file containing the desired text, it will be included in the output file. The two .srt files must have same number of subtitles and same number of lines per subtitle. Since text will likely be different between srt files, tag.c only works for opening tags that appear at the beginning of a line and closing tags that appear at the end of a line. Compile: gcc -Wall tag.c -o tag Usage: ./tag taginputfilename.srt textinputfilename.srt Output: out.srt The following are some examples of formatting tag position arrangements in taginputfilename.srt that tag.c can process: 2 00:00:44,819 --> 00:00:46,819 <font color="#ffffff">* Unheimliches Knurren * </font> 3 00:00:47,719 --> 00:00:51,219 <font color="#ffffff">Ich geh vielleicht als Frankenstein auf Falkenstein. </font> 4 00:00:51,319 --> 00:00:55,819 Boah. Also ich hab noch keinen Plan, als was ich mich verkleide. 5 00:00:55,819 --> 00:00:58,519 <font color="#00ffff"><i>Aber Bibi, </i></font> <font color="#00ffff">die hat richtig gute Ideen, </font> |
fixtag.c | Read an existing SubRip (.srt) file and look for and fix some common markup tag errors. Tags included: italics, bold, underline, strikethrough, font color, font size, and position. Using the optional close argument will cause fixtag to append missing closing tags to the last line of text of the subtitle. You should always compare out.srt with the original subtitle file to determine the author's intent. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall fixtag.c -o fixtag Usage: ./fixtag inputfilename.srt [close] Output: out.srt |
striptag.c | Read an existing SubRip (.srt) file and remove markup tags. Tags included: italics, bold, underline, strikethrough, font color, font size, and position. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall striptag.c -o striptag Usage: ./striptag inputfilename.srt Output: out.srt |
time-text.c | Take the time stamps from one SubRip (.srt) file and the subtitle texts from another SubRip file and create a new SubRip file with those timestamps and subtitle texts. Obviously the two input SubRip files should have the same number of subtitles. If present, the Byte Order Mark (BOM) of the text SubRip file will be included in the output file. Compile: gcc -Wall time-text.c -o time-text Usage: ./time-text timeinputfile.srt textinputfile.srt Output: out.srt |
combine.c | Read an existing SubRip (.srt) file and combine subtitles with identical textual content and immediately-adjacent, consecutive timestamps. Here, "immediately-adjacent, consecutive timestamps" means there is no time period between the two identical subtitles in which no subtitle is displayed. I've only encountered this unusual situation once; I assume the subtitles were machine-generated. Within each group of matching subs, combine.c takes the starting time-stamp from the first subtitle and ending time-stamp from the last and writes a new SubRip file. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall combine.c -o combine Usage: ./combine inputfilename.srt Output: out.srt |
split.c | I only created this to produce test files for combine.c. Read an existing SubRip (.srt) file and split each subtitle into two identical subs with consecutive timestamps. If present, the Byte Order Mark (BOM) of the input file will be included in the output file. Compile: gcc -Wall split.c -o split Usage: ./split inputfilename.srt Output: out.srt |
readbom.c | Read an existing SubRip (.srt) or text file and determine if there is an existing Byte Order Mark (BOM) and report results to stdout. This does not detect character-encoding by analyzing the text; see ced and enc below instead. Compile: gcc -Wall readbom.c -o readbom Usage: ./readbom inputfilename Output: reports to stdout |
writebom.c | Read an existing SubRip (.srt) or text file and if an existing Byte Order Mark (BOM) does not already exist, prepend the BOM selected by the user. This does not change character-encoding of the text; see enc below instead. Compile: gcc -Wall writebom.c -o writebom Usage: ./writebom inputfilename Output: out.txt |
stripbom.c | Strip the Byte Order Mark (BOM), if it exists, from a SubRip (.srt) or text file. Compile: gcc -Wall stripbom.c -o stripbom Usage: ./stripbom inputfilename Output: out.txt |
ced.tar.gz | This is Google's 2016 character-encoding detector ced adapted by me to be used as a command line tool to analyze SubRip (.srt) or text files (rather than web pages within a web browser as originally intended). Compile: gunzip, untar, and then use make. Result is ced. Their code produces numerous compilation warnings which you can safely ignore. Usage: ./ced inputfilename. Output: reports to stdout |
enc.tar.gz | Detect likely character-encoding of a SubRip (.srt) or text file using ced and also Linux's chardet, ask user what their best guess is for character-encoding based on the results, convert to UTF-8 using Linux's iconv, and prepend a UTF-8 Byte Order Mark (BOM) if requested. The ced code is built into enc here, and chardet and iconv are executed via system calls. I use Ubuntu which includes chardet and iconv but I don't know about other Linux flavors. Compile: gunzip, untar, and then use make. Result is enc. Google's ced code produces numerous compilation warnings which you can safely ignore. Usage: ./enc inputfilename When prompted for choice of encoding, enter the name of the encoding, for example: KOI8-R or ISO_8859-2. In general, I believe it won't be upper or lowercase-dependent. Output: out.txt |
samples.tar.gz | This is a collection of character-encoding sample files I created. The source of the original English UTF-8 text was https://en.wikipedia.org/wiki/Mary_Read. I used that UTF-8.en.sample, replaced non-standard quotes with ", deepl and google translate to convert to other languages, then iconv to convert to various other encodings. These were only semi-useful, as sometimes chardet detects correctly, sometimes ced, sometimes neither. For some reason I haven't figured out, I found chardetng to be least reliable so I don't use it. Some of these encodings are identical but have different names since some encodings are subsets of others. |
time-diff.c | Given two timestamps from standard input, calculate the time difference and output as a timestamp. Compile: gcc -Wall time-diff.c -o time-diff Usage: ./time-diff Output: reports to stdout |
time-add.c | Given two timestamps from standard input, calculate the sum and output as a timestamp. Compile: gcc -Wall time-add.c -o time-add Usage: ./time-add Output: reports to stdout |
Table 2: | PGS (.sup) Tool |
pgs.c | A tool to analyze a PGS (.sup) file and produce a report file. Optional functions include producing a bitmap file for each subtitle, applying an offset to timestamps, and synchronizing all timestamps to new anchor-points. Subtitle durations are preserved when applying offsets or synchronizing to new anchor points. The .sup input file could be derived from an HD (1080p) or UHD (4K) BluRay. Compile: gcc -Wall pgs.c -lm -o pgs Usage: ./pgs filename.sup [option]
Output: pgs.out, and optionally: bitmap file for each subtitle, or offset/resynchronized PGS file out.sup |
Table 3: | Chapter Tool |
chapters.c | Create an XML chapters file given the feature duration and desired number of chapters. The resulting .xml file can be added to a video container using tools like MKVToolNix GUI. There are often 12 to 16 chapters for a typical 1.5 hour feature. I usually use ffmpeg to find the duration. e.g., ffmpeg -i filename.mkv Note that chapters.c expects the same time notation as ffmpeg i.e., using fractions of a second. (This is different from SubRip/srt file format which is ",milliseconds".) You can copy and paste the result from ffmpeg. Compile: gcc -Wall chapters.c -o chapters Usage: ./chapters Output: chapters.xml |
Table 4: | Colorspace Tools |
References: ITU-R BT.709-6, ITU-T H.273 (V4), and SMPTE RP 177-1993 |
bt709.c | Derive the RGB color constants for BT.709 YCbCr colorspace. The Normalized Primary Matrix (NPM), which includes KR, KG, and KB, is derived from the color primaries as defined in the BT.709 standard. Compile: gcc -Wall bt709.c -lm -o bt709 Usage: ./bt709 Output: reports to stdout |
ycbcr2rgb.c | Convert YCbCr (BT.709) to 8-bit sRGB. Assumes BT.709 color primaries and gamma-correction were used to produce YCbCr. Uses sRGB color primaries (same as BT.709) and applies sRGB gamma-correction to produce sRGB coordinates. Compile: gcc -Wall ycbcr2rgb.c -lm -o ycbcr2rgb Usage: ./ycbcr2rgb Output: reports to stdout |
rgb2ycbcr.c | Convert 8-bit sRGB to YCbCr (BT.709). Assumes sRGB gamma-correction was applied to sRGB. Uses BT.709 color primaries and applies BT.709 gamma-correction to produce YCbCr. Compile: gcc -Wall rgb2ycbcr.c -lm -o rgb2ycbcr Usage: ./rgb2ycbcr Output: reports to stdout |
P. David Buchan pdbuchan@gmail.com