SUBSHOP

Tools to download, remove ads, and synchronize subtitles.

Purpose

subshop, or “Subtitle Workshop”, is a set of subtitle tools intended to mostly automate:

  • the downloading of subtitles (if needed) for video files,
  • synchronizing external subtitles with the audio track, and
  • removing ads from the subtitles.

When necessary and preferred, the tools can be use manually for in-a-hurry situations and/or improving/correcting the automated decisions.

There are some novel features that enhance the user experience including:

  • explicitly and intuitively scoring the subtitle synchronization so that the user (and tasks) know which subtitles need remediation,
  • caching “precious” information, like reference subtitles, so that multiple subtitle sync trials can be done quickly,
  • implementing both linear and segmented linear adjustments to subtitles for best chance at synchronization.

Limitations

Current limitations of these tools are:

  • only English subtitles and audio tracks are supported,
  • only srt subtitles are supported
  • for best / most automated operation, movies and TV episodes must be organized in a PLEX-like directory structure,
  • auxiliary information is stored (mostly) in a “cache” directory, one per video file,
  • must run on a modern Linux (or sufficient Linux comparable) operating system.
  • Python 3.6 is the bare minimum, but Python 3.8+ is best.
  • Python 2.x is required if you choose to use autosub.

Required Web Credentials

You are expected to obtain:

OpenSubtitles.org is fairly unreliable (e.g., you might find it down 10% of the day). subshop attempts to use these unreliable/stingy resources efficiently, and automating tasks in the background can reduce frustration. The cache files are an important part of that strategy.

Plex is used narrowly; if you have a large collection and/or very limited CPU/RAM resources, you can configure subshop to use Plex for its searches which, for some installs, reduces searches for videos from, say, 10 minutes to nearly instantaneous.

Installation, Configuration, and Preparation

Installation Procedure

Clone the project into, say, your home directory; and install into, say, your your directories:

    $ cd; git clone https://github.com/joedefen/subshop.git
    $ cd subshop; pip3 install . --user

Having installed the code and its python dependencies, now install the non-python dependencies (e.g., ffmpeg, ffprobe, and the VOSK Model).

    $ ./setup-sys-deps

NOTE: inst-sys-deps will not work for every Linux variant, and you may need to whatever it cannot.

As a quick test, run subshop dirs; this shows the folders that subshop uses to store persistent data and it creates the default configuration file (which always requires adjustment).

Configuration

The configuration is stored in subshop.yaml, and, by default, in the ~/.cache/subshop/ folder.

You’ll need to edit subshop.py and:

  • at least, change the YOUR-{something} values; use the comments to enlighten you on what is needed.
  • review other values and ensure the defaults are desirable for you.

The essential configuration to update is:

  - /YOUR-TV-ROOTDIRS
  - /YOUR-MOVIE-ROOTDIRS
  - tmdb-apikey: YOUR-TMDB-APIKEY # from tmdb-api.com
  - opensubtitles-org-usr-pwd: YOUR-USER YOUR-PASSWD # from opensubtitles.org (200down/day free)

Note:

  • You must list your TV and Movie directory trees for:
    • the built-in search
    • for giving type hints to the video filename parser.
  • The Movie Database API is used to ascertain and save IMDB IDs for accurate downloads.
  • The OpenSubtitles.org user/password is required for downloading subtitles.

And, you may wish to configure PLEX sooner rather than later (to be sure, PLEX is NOT required); otherwise, disable it.

  - plex-url-token: YOUR-PLEX-URL YOUR-PLEX-TOKEN # if empty string, plex is not enabled
  - search-using-plex: false # search for videos w plex (if configured)?
  - plex-path-adj: "" # set -/{prefix} and/or +/{prefix} to make local path

Specifically, for PLEX:

  • The plex-url-token is needed if you wish to search for video files using Plex vs the built-in search; the bigger and slower your disk, the more likely Plex will be faster (and sometimes night-and-day faster). Also, the Plex searches are “smarter”. Also,
    • Set search-using-plex to true if you wish that default.
    • Set plex-path-adj appropriately if plex’s view of the file system disagrees with the local view (and your are using Plex for searches).

Tip:

  • after making/writing changes to subshop.yaml, stay in the editor and run in subshop run ConfigSubshop in a subshell.
  • make a few changes, write, check, and repair; the more changes you make w/o checking, the harder to isolate the problem.

The config will not load if:

  • there are YAML syntax errors, or
  • the expected basic type of a parameter is incorrect (e.g., you change a string parameter to a numberic type).

Expected Video Folder Organization

TV series and movies should be organized similar to this:

/TV Series Root1/  # there can be many tv root folders
    Alpha Show/
        omdbinfo.yaml # subshop cached IMDB info (from OMDb usually)
        alpha.show.s01e01.anything.mkv # .ext can vary
        alpha.show.s01e01.anything.en.srt # .en.srt can vary
        alpha.show.s01e01.anything.cache/ # subshop cached info folder
        alpha show 1x2 anything.avi # .ext can vary
        alpha show 1x2 anything.en04.srt # .en04.srt can vary
        alpha show 1x2 anything.cache/ # subshop cached info
        ...
    Beta Show/
        omdbinfo.yaml # subshop cached IMDB info (from OMDb usually)
        /Season 01  # episodes under seasons share omdbinfo
            /beta.show.s01e01.anything.mkv # .ext can vary
            ...
        ...
    Gamma Show/
        Gamma Show 1/  # episodes under non-season dirs have own omdbinfo
            omdbinfo.yaml # subshop cached IMDB info (from OMDb usually)
            /gamma.show.s01e01.anything.mkv # .ext can vary
            ...
        ...
/Movie Root1/  # there can be many movie root folders
    Movies for Mom/ there can be many movie group folders
        Movie.Alpha.2020.anything.mkv # .ext may vary
        Movie.Alpha.2020.anything.en07.srt # .en07.srt may vary
        Movie.Alpha.2020.anything.cache/ # subshop cache info
            omdbinfo.yaml # omdb is specific to one movie
            ...
        Movie Beta (1999) anything/ # hierachical vs (above) flat
            Movie Beta (1999) anything.mkv # .ext may vary
            Movie Beta (1999) anything.en07.srt # .en07.srt may vary
            Movie Beta (1999) anything.cache/ # subshop cache info
                omdbinfo.yaml # omdb is specific to one movie
            ...

NOTES:

  • If your video collection is compatible with either PLEX or Emby, you likely have a suitable organization already.
  • You supply the video files and existing external subtitles (optionally, with .en.srt or .srt extensions) and the basic folder hierarchy that:
    • separates TV series and movies at a high level.
    • that creates groups TV series and movies arbitrarily.
    • places TV episodes into season folders or not (consistently per TV show). If you have season folders, then “Season 00/” and “Specials/” folders indicate TV specials.
    • places movies in subfolders with the video filenames less extensions or not (consistency is NOT required even withing one group).
  • subshop adds subtitles files and cached information; all its cached data is in .cache folders except for TV series omdbinfo.yaml files.

Video File Naming Conventions

Video filenames should be “parsable” by SubShop (see subshop parse subcommand) meaning:

  • for TV episodes, subshop can parse the show name, season number, and episode number, and
  • for movies, subshop can parse the title and year.
    • if the year is not present, subshop may work well.
    • if the year is wrong, subshop likely will not work well.

subshop uses its own parser, VideoParser.py to parse the names. Within the script, you can verify what is likely to work and what is not by looking for:

  • regexes: examine the regular expressions and comments
  • tests_yaml: look at the tests (mostly hard cases) and the parsing results; notice that a few cases near the end are “failures”.

Anyhow, we advise running subshop parse on your entire collection, and, if you wish subshop to work well, fix its complaints (although movies w/o the year are more optional than unparsable tv episodes numbers).

Description/Rationale for the Cached Files

subshop creates a number of cached files; specifically:

  • omdbinfo.yaml: caches/stores the IMDB ID and other info gathered from TMDb (The Movie Database). Caching this information makes it “sticky” so that retrying a subtitle search for a better fit is more reliable.
  • probeinfo.yaml: caches selected info from ffprobe to avoid the second or so per video file to determine if it has embedded subtitles, has an English audio stream, etc.
  • *.REFERENCE.srt or *.AUTOSUB.srt: caches the (very expensive) audio-to-text conversion needed to sync / score the fit of subtitles; having this makes finding better subtitles, etc., much, much faster.
  • *.EMBEDDED.srt: subshop can extract and sync embedded subtitles when you wish to do so because they are misfits.
  • *.TORRRENT.srt: stores any “original” subtitle (via torrent or not). If you replace the original, you can return to it or reprocess it for any reason.
  • *.srt: other downloaded subtitles are kept for possible reprocessing but also to know what has been tried so that re-download subtitles for a better fit can avoid duplicate downloads.
  • quirk.*: per cache, subshop stores at most one “quirk” file for faster screening. The quirk types from highest priority to least are:
    • quirk.FOREIGN: has no English audio track (so automatically ignored).
    • quirk.IGNORE: manually ignored (because you don’t care or you wish to stop trying to find/sync subtitles for “lost causes”).
    • quirk.SCORE.{NM}: the two-digit “score” of the defaulted subtitle (usually name *.en.srt); scores are used to automatically select the best subtitle fit.
    • quirk.AUTODEFER: the automatic download failed or sync produced poor results. The age of this file and the age of the video file determine when automatic retries are done.
    • quirk.INTERNAL: has embedded subtitles; this file is acts as a “soft” automatic ignore; you can extract/sync the embedded subtitles or “force” the download of subtitles to override the automatic reluctance.

SUBSHOP Command Use

Common terms and options

subshop’s Options, Sub-Commands, and Targets

The subshop sub-commands often share terminology and conventions. The typical form of a subcommand is:

    $ subshop -h # shows the available subcommands
    $ subshop {subcmd} -h # help for the given subcommand
    $ subshop {subcmd} [{options}] {targets} # typical use

Selecting subshop “Options”

Options are specified with -{letter} or --{word} arguments. In Python 3.7+, options and non-options can be intermixed; otherwise, you must place all sub-command options immediately after the {subcmd}. Here are a few common options:

  • -h/--help: shows basic usage
  • -n/--dry-run: shows what a sub-command would do, more or less.
  • -v/--verbose: add more detail to output
  • -V/--log-level {level}: where the {level} may be standard logging levels (e.g. “INFO”) or any of the custom levels shown by --help.
  • -o/--only {tv|movie}: select only TV episodes or movie.
  • -O/--one: select just the shortest title match, preferring TV episodes if both a TV show and movie match.
  • -p/--plex: if Plex is configured, use Plex for searches.
  • -P/--avoid-plex: even if Plex is configure, use built in search.

See the description of every sub-command in the “Sub-Commands” section below.

Selecting subshop “Targets”

Most of the sub-command operate on video file “targets”. The {targets} may be either:

  • a list of folders and/or video files, or

  • a TV show, season, or episode specifying the title and options season and episode; e.g.,

    • blue bloods # all seasons/episode
    • blue bloods 1x # or s01 to specify season 1
    • blue bloods 1x3 # or s01e03 to specify a specific episode
  • a movie title; e.g. wonder woman

For a non-PLEX search, a given title matches the video file only if:

  1. the specified title is an exact substring in the parsed video file title (ignoring case), and
  2. (for tv episodes only) if supplied, the given season/episode matches in spirit (e.g., “2×3” matches “s02e03”)

In a non-PLEX titles search, the matched video with the shortest video name are selected by default; so you may need to lengthen the name for a precise match. Also, you can opt:

  • -e/--every to have every title match be selected

If the title search is inadequate, using the pathname to video (or folder of videos) always works.

You may restrict targets to TV episodes or movies with the option:

  • -o/--only{type} where {type} may be “tv” or “movie”.

Reference Subtitles

Reference subtitles are generated (and cached) by the external tool, autosub, or the internal tool, video2srt.

  • Reference subtitles are generally unusable as “real”, external subtitles because they have too many omissions/errors.
  • But, reference subtitles are generally good enough to correlate with external subtitles to synchronize those with the video;
  • Reference are also used to judge how well subtitles are synced with the video.

video2srt generally does a more accurate job than autosub, but autosub might be preferred if local CPU and/or memory resources are scarce since most the (considerable) effort is exported to the cloud.

video2srt uses vosk · PyPI to create reference subtitles.

Subtitle Score

Subtitles are given a score from 1 to 19 that represents:

  • the tenths of seconds of standard deviation of the subtites to the reference, plus
  • a penalty of 0 to 20 if the number captions correlated to the reference subtitles is under 50

If the net subtitle score is not between 1 and 19, then it is coereced within.

Score subtitles are rename with their score; e.g., “foobar.en.srt” is renamed “foobar.en09.srt” to indicate its score is 9 when analyzed. For filtering purposes, an unscored subtitle is given an arbitrary, large score (e.g., 100) but that score is not put into its file name. Beware:

  • many players (e.g, vlc) handle the en09.srt w/o ado, but,
  • mpv does not by default; you must add --autosub=fuzzy to its arguments or add autosub=fuzzy to your ~/.config/mpv/mpv.conf.

Filtering on score. Some sub-commands honor options:

  • -m/--min-score {score}: filter for videos with subtitles only as poor as the given floor
  • -M/--max-score {score}): filter for videos with subtitles no worse than then given score.

Sub-commands

subshop stat {targets} # get basic subtitle status

Shows the summary subtitle information of the targets.

  • -v/--verbose: shows much detail including all cached information
  • -m/--min-score: process only subtitles with at least the given minimum score
  • -M/--max-score: process only subtitles with no more than given maximum score

subshop search {targets} # search for TV shows and/or movies

Shows a “search” result meaning:

  • tvshow folders, and
  • movie video files.

This command requires a search phrase built from the {targets}; for this subcommand {targets} cannot be files/folders and at least on is required.

If Plex is required, you may wish to compare/time searches with and without Plex (i.e., -p and -P). Also, this command is handy if you wish to know where certain media is stored.

The search results are usually fairly similar unless subshop and Plex do search the same folders; to use Plex, ensure they agree.

Search times will depend on many factors; but, the slower your disk performance, the more likely that Plex will perform better comparatively.

subshop ref {targets} # generate reference subtitles

Generates reference subtitles for the given targets ONLY if (1) the target has no reference subtitles, and (2) the target has external subtitles OR has no internal subtitles, (3) there is an English audio stream, and (4) the video is in the “IGNORE” state.

  • -n/--dryrun: use to verify how many/which reference subtitles you generate.
  • --random: randomize the targets
  • -q/--quota: cut off target after the limit
  • --todo: work down the TODO list (see “todo” subcommand) rather than {targets}
  • -d/--days: only process videos newer than a given number of days

Note, generating reference subtitles is usually a byproduct of subshop dos, but that might be limited by quota or outages and pre-generating the reference subtitles can speed the eventual download-and-sync operation whether done manually or in the backgrond.

subshop dos {targets} # download-and-sync subtitles

Download and sync subtitles for the targets. Requires (1) an English audio stream, (2) not in IGNORE state, (3) no internal or externals subs, (4) not a TV special, and (5) either interactive or not auto maintenance deferred.

  • -i/--interactive: for manual control over selecting the OMDB match and subtitles (although normally, use the non-interactive mode unless expecting/having problems getting the initial subtitles)
  • -n/--dryrun: use to verify how many/which reference subtitles you generate.
  • --todo: work down the TODO list (see “todo” subcommand) rather than {targets}
  • -q/--quota: cut off target after the limit

The dos sub-command can be run rather indiscriminately because it restricts itself to targets that need subs, can have subs, and subs are desired.

subshop redos {target} # re-download-and-sync subtitles

Re-do the download and sync of subtitles for the targets; this is typically only done to correct automatic subtitle download and sync resulting in no found subtitles or misfit subtitles. Requires (1) an English audio stream, (2) not in IGNORE state, (3) has internal or externals subs.

  • -i/--interactive: for manual control over selecting the OMDB match and subtitles `(although normally, use the interactive mode to manually repair problems).
  • -m/--min-score: process only subtiles with at least the given minimum score
  • -M/--max-score: process only subtiles with no more than given maximum score
  • --todo: work down the TODO list (see “todo” subcommand) rather than {targets}

Running redos interactively allows you to correct IMDB information and search differently for subtitles in the case automatic search results were poor.

subshop sync {target} # synchronize (yet again) subtitles

Re-do the sync of subtitles for the targets (w/o a download) for whatever reason (e.g., changed sync parameters or replaced reference subtitles or reexamine details of the synchronization). The reference subtitles will be regenerated if necessary. Requires (1) an English audio stream, (2) not in IGNORE state, (3) has internal or externals subs.

  • -m/--min-score: process only subtitles with at least the given minimum score
  • -M/--max-score: process only subtitles with no more than given maximum score
  • -v/--verbose: shows the correlated reference/non-reference subtitles and timing differences; if there are many non-trivial text matches, then the video and subtitles are very likely to belong to the same movie or episode; if the timing differences have large discontinuities, there are huge rifts in the video relative to the subtitles.

subshop anal {targets} # analyze the quality of subtitles

Re-analyzes the subtitles. If tuning parameters have changed, you might get different results.

  • -m/--min-score: process only subtitles with at least the given minimum score
  • -M/--max-score: process only subtitles with no more than given maximum score

subshop todo {targets} # create TODO lists for maintenance

Creates lists of videos that need subtitles, need reference subtitles, or have poor fits suggesting that retrying is in order. The commands that honor the --todo option will try to work down appropriate lists. The lists are named:

  • vip-dos: newer videos w/o subs but with reference subs.
  • vip-ref-dos: newer videos w/o subs and w/o reference subs.
  • dos: videos w/o subs but with reference subs.
  • ref-dos: videos w/o subs and w/o reference subs.
  • redos: videos with misfit subs ready to retry.
  • defer-dos: videos w/o subs awaiting auto retry.
  • defer-redos: videos with misfit subs awaiting auto retry.

Notes:

  • Normally, running this command overwrites the current set of TODO lists (i.e., there can only be one set).
  • In some installs, this command can takes minutes, but the commands working down the TODO lists should start fast.
  • Normally, provide no targets so your entire collection is scanned; if you wish to focus on a subset of your collection, then provide targets.
  • The number of TODO items per list actually stored is limited by configuration (since there is no need items than doable in a day); the stored items are a random sample; when other commands tackle a TODO list, they do so in random order.

Some commonly used options with todo:

  • -v/--verbose: shows all the TODO items, not just the summary.
  • -n/--dry-run: only shows the current state of the TODO list; with -v shows every remaining item.

subshop ignore {targets} # disable subtitle actions

Sets the state to IGNORE for the video; this will inhibit most sub-command actions on the target except for ‘unignore’.

subshop unignore {targets} # re-enable subtitle actions

Clears the IGNORE state for the video; this enables most sub-command actions on the target.

subshop zap {targets} # remove external subtitles

Remove external subtitles for the {targets}. Obviously, use with caution since you can easily remove all your subtitles.

  • -n/--dryrun: use to verify how many/which subtitles you would remove.

subshop -D{secs} delay {targets} # manually shift subtitle times

Delays the subtitles by the amount given in the -D/–delay-secs option. A negative amount make the subtitles appear earlier rather than later. Requires an “installed” (apparently English) subtitle and acts only on the preferred one (e.g, “video.en.srt” is preferred over “video.srt”).

One use case is to adjust English subtitles for a foreign language video since subshop does not support non-English language audio.

Your media player likely has a mechanism to ascertain the delay manually; e.g.:

  • mpv player: ‘z’ adds 100ms delay; ‘Z’ subtracts 100ms delay; pass the cumulative amount as the -D value.
  • VLC media player: ‘h’ adds 50ms delay and ‘g’ subtracts 50ms delay; pass the cumulative amount as the -D value.
  • PLEX web player: select “Playback Settings / Subtitle Offset” and then click buttons to adjust the offset by +50ms or -50ms; pass the negative of the cumulative offset as the -D value. Note that PLEX on Roku does not support subtitle offset adjustment.

Honored options include:

  • -D/-delay-secs: the amount of time to delay the subtitles; if the absolute value is not under 50, then the time is presumed to be in milliseconds. Setting -D0.0 makes sense if desiring only to rerun the ad detection and removal.
  • -i/--interactive: if ads are detected, you get a chance to allow/deny their removal.

Beware:

  • Avoid specifying multiple targets since the same delay will be applied to every target.
  • Ads will be removed again; if your ad detection parameters are changed, then more ads may be removed.
  • This command overwrites the subtitle file.
  • If you specify a positive delay, subtitles with negative times are removed and not reversable with another run with a negative delay.

subshop grep {targets} # find patterns in subtitles

Used to verify what ads would be removed if run on the current, external subtitles for the targeted videos, and optionally remove the matching subtitles. With -g, you can specify an ad hoc pattern; with -G, you can apply the configured regexes. You can specify both -g and -G, and if you specify neither, then -G is assumed.

Suggested uses:

  • Use -g to search for a possible pattern to configure if it is a good identifier of ads (i.e., very few or no false positives).
  • Use -G to determine what ads would be removed if (presumably) updated, configured regexes were applied.
  • Use -fG to remove ads per the current set of configured regexes.

Honored options include:

  • -g/–grep {regex} – grep for the given {regex}
  • -G/–grep-regexes – grep the configured regexes.
  • -f/–force – update the subtitles by removing the matched captions.

subshop parse {targets} # check parsability of video filenames

Used to check the parsing accuracy of your video files; i.e.,

  • for TV episodes, subshop should be able to parse the show name, the season and the episode number.
  • for movies, shopshop should be able to parse the title and year.

It not necessary that EVERY video file is parsable, but unparseable videos will impair both automated and manual download tasks. Less “fits-the-pattern” episodes (e.g., “special” episodes, double episodes, etc.) are problematic no matter how named. You can decide to rename parsing exceptions or not.

  • -v/--verbose: shows how every target is parsed; by default, only likely errors are shown.

subshop imdb {targets} # verify/update IMDB info for videos

Views, sets, and corrects the IMDB information for the TV show or movie. For downloading the correct subtitles automatically, having a correct IMDB association reduces error considerably.

  • -i/--interactive: shows the IMDB information and gives you opportunity to update it.
  • -n/--dryrun: use to see whether the IMDB is cached or not.

To generate a list of TV shows / movies w/o cached IMDB info, run:

  • subshop -n imdb | grep -B1 create

Here is an example of setting IMDB info:

<div class="snippet-clipboard-content position-relative overflow-auto" data-snippet-clipboard-copy-content=" subshop -i imdb spirited away

=> Spirited Away 2001 720p BluRay x264-REKD.mkv IN /heap/Videos/Movies/Movies=Old
2021-09-17:17:13:40.452 ERR imdb-api query failed: status=404 reason=Not Found err=
url=https://imdb-api.com/en/API//SearchMovie/k_eu9efx7m/Spirited%20Away%202001 [OmdbTool.py:337]

>>> OMDb Search Results for ” spirited away” (2001): no matches [0] cancel search>
> Enter (0-0) [add “p” for poster] -OR-
?: “>


    
      subshop -i imdb spirited away

=> Spirited Away 2001 720p BluRay x264-REKD.mkv IN /heap/Videos/Movies/Movies=Old
2021-09-17:17:13:40.452 ERR  imdb-api query failed: status=404 reason=Not Found err=
     
      
   url=https://imdb-api.com/en/API//SearchMovie/k_eu9efx7m/Spirited%20Away%202001 [OmdbTool.py:337]

>>> OMDb Search Results for "Spirited Away" (2001): 
 NO matches
[0] Cancel search

>> Enter (0-0) [add "p" for poster] -OR- 
      
       ?: