by eikek

on 2022-05-16

Audio file support🔗

Since version 0.36.0 Docspell can be extended by addons - external programs that are executed at some defined point in Docspell. This is a walk through the first addon that was created, mainly as an example: providing support for audio files.

I think it is interesting to provide support for audio files for a DMS, although admittedly I don't have much of a use :). But this is the kind of use-case that addons are for.

The idea🔗

The idea is very simple: the real work is done by external programs, most notably coqui's stt a deep learning toolkit originally created at Mozilla. It provides a command line tool that accepts a WAV file and spits out text. Perfect!

With this text, a PDF file can be created and a preview image which is already enough for basic support. You can see the pdf in the web-ui and search for the text via SOLR or PostgreSQL.

Because a WAV file is not the most popular format today, ffmpeg can be used to transform any other audio to WAV.

The only thing now is to create a program that checks the uploaded files, filters out all audio files and runs them through the mentioned programs. So let's do this.

Preparation🔗

Addons are external programs and can be written in whatever language…. For me this is a good opportunity to refresh my rusty scheme know-how a bit. So this addon is written in Scheme, in particular guile. Programming in scheme is fun and guile provides good integration into the (posix) OS and also has a nice JSON module. I had the reference docs open all the time - look at them for further details on the used functions.

It's usually good to play around with the tools at first. For stt, we first need to download a model. This will be used to "detect" the text in the audio data. They have a page where we can download model files for any supported language. For the addon, we will implement English and German.

When creating a PDF with wkhtmltopdf, we prettify it a little by embedding the plain text into some html template. This will also take care to specifiy UTF-8 as default encoding directly in the HTML template.

FFMpeg just works as usual. It figures out the input format automatically and knows from the extension of the output file what to do.

You can find the full code here. The following shows excerpts from it with some explanation.

The script🔗

Helpers🔗

After the preamble, there are two helper functions.

(define* (errln formatstr . args)
  (apply format (current-error-port) formatstr args)
  (newline))

;; Macro for executing system commands and making this program exit in
;; case of failure.
(define-syntax sysexec
  (syntax-rules ()
    ((sysexec exp ...)
     (let ((rc (apply system* (list exp ...))))
       (unless (eqv? rc EXIT_SUCCESS)
         (format (current-error-port) "> '~a …' failed with: ~#*~:*~d~%" exp ... rc)
         (exit 1))
       #t))))

As this addon wants to pass data back to Docspell via stdout, we use the stderr for logging and printing general information. The function errln (short for "error line" :)) allows to conveniently print to stderr and the second wraps the system* procedure such that the script fails whenever the external program fails. It is somewhat similar to set -e in bash.

Dependencies🔗

Next is the declaration of external dependencies. At first all external programs are listed. This is important for later, when the script is packaged via nix. Nix will substitute these commands with absolute paths. Then it's good to not have them scattered around.

It also reads in the expected environment variables (only those we need) that are provided by Docspell. Since this addon only makes sense to work on an item, it quits early should some env vars are missing.

(define *curl* "curl")
(define *ffmpeg* "ffmpeg")
(define *stt* "stt")
(define *wkhtmltopdf* "wkhtmltopdf")

;; Getting some environment variables
(define *output-dir* (getenv "OUTPUT_DIR"))
(define *tmp-dir* (getenv "TMP_DIR"))
(define *cache-dir* (getenv "CACHE_DIR"))

(define *item-data-json* (getenv "ITEM_DATA_JSON"))
(define *original-files-json* (getenv "ITEM_ORIGINAL_JSON"))
(define *original-files-dir* (getenv "ITEM_ORIGINAL_DIR"))

;; fail early if not in the right context
(when (not *item-data-json*)
  (errln "No item data json file found.")
  (exit 1))

Input/Output🔗

The input and output schemas can be defined now. This uses the guile-json module. It provides very convenient features for reading and writing json.

It is possible to define a record via define-json-type that generates readers and writers to/from JSON. For example, the record <itemdata> is defined to be an object with only one field id. The function json->scm reads in json into scheme datastructures and then the generated function scm->itemdata creates the record from it. For every record, accessor functions exists. For example: (itemdata-id data) would lookup the field id in the given itemdata record data.

Here we need it to get the item-id and the list of file properties belonging to the original uploaded files.

Another interesting definition is the <output> record. This captures (a subset of) the schema of what Docspell receives from this addon as a result. A full example of this data is here. We don't need commands or newItems, so this schema only cares about the files attribute.

(define-json-type <itemdata>
  (id))

;; The array of original files
(define-json-type <original-file>
  (id)
  (name)
  (position)
  (language)
  (mimetype)
  (length)
  (checksum))

;; The output record, what is returned to docspell
(define-json-type <itemfiles>
  (itemId)
  (textFiles)
  (pdfFiles))
(define-json-type <output>
  (files "files" #(<itemfiles>)))

;; Parses the JSON containing the item information
(define *itemdata-json*
  (scm->itemdata (call-with-input-file *item-data-json* json->scm)))

;; The JSON file containing meta data for all source files as vector.
(define *original-meta-json*
  (let ((props (vector->list (call-with-input-file *original-files-json* json->scm))))
    (map scm->original-file props)))

Finding the audio file🔗

The previously parsed json array *original-meta-json* can now be used to find any audio files within the original uploaded files, as done in find-audio-files. It simply goes through the list and keeps those files whose mimetype starts with audio/. The mimetype is provided by Docspell in the file properties in ITEM_ORIGINAL_JSON.

Before converting to wav with ffmpeg, it is quickly checked if it's not a wav already.

(define (is-wav? mime)
  "Test whether the mimetype MIME is denoting a wav file."
  (or (string-suffix? "/wav" mime)
      (string-suffix? "/x-wav" mime)
      (string-suffix? "/vnd.wav" mime)))

(define (find-audio-files)
  "Find all source files that are audio files."
  (filter! (lambda (el)
             (string-prefix?
              "audio/"
              (original-file-mimetype el)))
           *original-meta-json*))

(define (convert-wav id mime)
  "Run ffmpeg to convert to wav."
  (let ((src-file (format #f "~a/~a" *original-files-dir* id))
        (out-file (format #f "~a/in.wav" *tmp-dir*)))
    (if (is-wav? mime)
        src-file
        (begin
          (errln "Running ffmpeg to convert wav file...")
          (sysexec *ffmpeg* "-loglevel" "error" "-y" "-i" src-file out-file)
          out-file))))

Speech to text🔗

Once we have a wav file, we can run speech-to-text recognition on it. As said above, we need to download a model first, which is depending on a language. Luckily, Docspell provides the language of the file. This is the lanugage either given directly by the user when uploading or it's the collective's default language.

In the following snippet, we get the language as arguments. We will get it later from the file properties.

As seen below, the model file is stored to the CACHE_DIR. This is provided by Docspell and will survive the execution of this script. All other directories involved will be deleted eventually. The CACHE_DIR is the place to store intermediate results you don't want to loose between addon runs. But as any cache, it may not exist the next time the addon is run. Docspell doesn't clear it automatically, though.

The last function simply executes the stt external command and puts stdout into a file.

(define (get-model language)
  (let* ((lang (or language "eng"))
         (file (format #f "~a/model_~a.pbmm" *cache-dir* lang)))
    (unless (file-exists? file)
      (download-model lang file))
    file))

(define (download-model lang file)
  "Download model files per language. Nix has currently stt 0.9.3 packaged."
  (let ((url (cond
              ((string= lang "eng") "https://coqui.gateway.scarf.sh/english/coqui/v0.9.3/model.pbmm")
              ((string= lang "deu") "https://coqui.gateway.scarf.sh/german/AASHISHAG/v0.9.0/model.pbmm")
              (else (error "Unsupported language: " lang)))))
    (errln "Downloading model file for language: ~a" lang)
    (sysexec *curl* "-SsL" "-o" file url)
    file))

(define (extract-text model input out)
  "Runs stt for speech-to-text and writes the text into the file OUT."
  (errln "Extracting text from audio…")
  (with-output-to-file out
    (lambda ()
      (sysexec  *stt* "--model" model "--audio" input))))

Create PDF🔗

Creating the PDF is straight forward. The extracted text is embedded into a HTML file which is then passed to wkhtmltopdf. Since we don't need this file for anything else, it is stored to the TMP_DIR.

(define (create-pdf txt-file out)
  (define (line str)
    (format #t "~a\n" str))
  (errln "Creating pdf file…")
  (let ((tmphtml (format #f "~a/text.html" *tmp-dir*)))
    (with-output-to-file tmphtml
      (lambda ()
        (line "<!DOCTYPE html>")
        (line "<html>")
        (line "  <head><meta charset=\"UTF-8\"></head>")
        (line "  <body style=\"padding: 2em; font-size: large;\">")
        (line " <div style=\"padding: 0.5em; font-size:normal; font-weight: bold; border: 1px solid black;\">")
        (line "  Extracted from audio using stt on ")
        (display (strftime "%c" (localtime (current-time))))
        (line " </div>")
        (line " <p>")
        (display (call-with-input-file txt-file read-string))
        (line " </p>")
        (line "</body></html>")))
    (sysexec *wkhtmltopdf* tmphtml out)))

Putting it together🔗

The main function now puts everything together. The process-file function is called for every file that is returned from (find-audio-files). It will extract the necessary information (like the language) from the json document via record accessors (e.g. original-file-lanugage file)) and then calls the functions defined above. At last it creates a <itemfile> record with make-itemfiles.

An <itemfile> record contains now the important information for Docspell. It requires the item-id and a mapping from attachment-ids to files in OUTPUT_DIR. For each attachment identified by its ID, Docspell replaces the extracted text with the contents of the given file and replaces the converted PDF file, respectively. In the code below, two lists of such mappings are defined - the first for the text files, the second for the converted pdf. The files must be specified relative to OUTPUT_DIR.

That means process-all returns a list of <itemfile> records which is then used to create the <output> record. And finally, a output->json function will turn the record into proper JSON which is send to stdout.

(define (process-file itemid file)
  "Processing a single audio file."
  (let* ((id (original-file-id file))
         (mime (original-file-mimetype file))
         (lang (original-file-language file))
         (txt-file (format #f "~a/~a.txt" *output-dir* id))
         (pdf-file (format #f "~a/~a.pdf" *output-dir* id))
         (wav (convert-wav id mime))
         (model (get-model lang)))
    (extract-text model wav txt-file)
    (create-pdf txt-file pdf-file)
    (make-itemfiles itemid
                    `((,id . ,(format #f "~a.txt" id)))
                    `((,id . ,(format #f "~a.pdf" id))))))

(define (process-all)
  (let ((item-id (itemdata-id *itemdata-json*)))
    (map (lambda (file)
           (process-file item-id file))
         (find-audio-files))))

(define (main args)
  (let ((out (make-output (process-all))))
    (format #t "~a" (output->json out))))

Example output:

{
  "files": [
    {
      "itemId":"qZDnyGIAJsXr",
      "textFiles": { "HPFvIDib6eA": "HPFvIDib6eA.txt" },
      "pdfFiles":  { "HPFvIDib6eA": "HPFvIDib6eA.pdf"}
    }
  ]
}

Packaging🔗

Now with that script some additional plumbing is needed to make it an "Addon" for Docspell.

The external tools - stt, ffmpeg, curl and wkhtmltopdf are required as well as guile to compile and interpret the script. Also the guile-json module must be installed.

This can turn into a quite tedious task. Luckily, there is nix that has an answer to this. A user who wants to use this script only needs to install nix. This package manager then takes care of providing the exact dependencies we need (down to the correct version and including guile as the language and runtime).

A flake🔗

Everything is defined in the flake.nix in the source root. It looks like this:

{
  description = "A docspell addon for basic audio file support";

  inputs = {
    utils.url = "github:numtide/flake-utils";

    # Nixpkgs / NixOS version to use.
    nixpkgs.url = "nixpkgs/nixos-21.11";
  };

  outputs = { self, nixpkgs, utils }:
    utils.lib.eachSystem ["x86_64-linux"] (system:
      let
        pkgs = import nixpkgs {
          inherit system;
          overlays = [

          ];
        };
        name = "audio-files-addon";
      in rec {
        packages.${name} = pkgs.callPackage ./nix/addon.nix {
          inherit name;
        };

        defaultPackage = packages.${name};

        apps.${name} = utils.lib.mkApp {
          inherit name;
          drv = packages.${name};
        };
        defaultApp = apps.${name};

        ## … omitted for brevity
      }
    );
}

First sad thing is, that only x86_64 systems are supported. This is due to stt not being available on other platforms currently (as provided by nixpkgs).

The rest is a bit magic: A package and "defaultPackage" is defined with a reference to nix/addon.nix. The important part is the line

  inputs = {
    # Nixpkgs / NixOS version to use.
    nixpkgs.url = "nixpkgs/nixos-21.11";
  };

It says that as input for "building" the script, we take all of nixpkgs which is a package collection defined for (and in) nix - including thousands of software packages. We can pick and choose from these. No surprise, all external tools we need are included!

A flake defines the inputs and outputs of a package. With all of nixpkgs as inputs, we can create a definition to elevate this script into a package.

Package definition🔗

The definition for "building" the script is in nix/addon.nix:

{ stdenv, bash, cacert, curl, stt, wkhtmltopdf, ffmpeg, guile, guile-json, lib, name }:

stdenv.mkDerivation {
  inherit name;
  src = lib.sources.cleanSource ../.;

  buildInputs = [ guile guile-json ];

  patchPhase = ''
    TARGET=src/addon.scm
    sed -i 's,\*curl\* "curl",\*curl\* "${curl}/bin/curl",g' $TARGET
    sed -i 's,\*ffmpeg\* "ffmpeg",\*ffmpeg\* "${ffmpeg}/bin/ffmpeg",g' $TARGET
    sed -i 's,\*stt\* "stt",\*stt\* "${stt}/bin/stt",g' $TARGET
    sed -i 's,\*wkhtmltopdf\* "wkhtmltopdf",\*wkhtmltopdf\* "${wkhtmltopdf}/bin/wkhtmltopdf",g' $TARGET
  '';

  buildPhase = ''
    guild compile -o ${name}.go src/addon.scm
  '';

  # module name must be same as <filename>.go
  installPhase = ''
    mkdir -p $out/{bin,lib}
    cp ${name}.go $out/lib/

    cat > $out/bin/${name} <<-EOF
    #!${bash}/bin/bash
    export SSL_CERT_FILE="${cacert}/etc/ssl/certs/ca-bundle.crt"
    exec -a "${name}" ${guile}/bin/guile -C ${guile-json}/share/guile/ccache -C $out/lib -e '(${name}) main' -c "" \$@
    EOF
    chmod +x $out/bin/${name}
  '';
}

With a bit of handwaving - this is a bash script that modifies slightly the scheme script and runs a compile on it. We simply declare all packages we need in the first line of { … } - these are arguments that are automatically filled by nix by searching the corresponding package in nixpkgs.

First the patchPhase is executed. It will replace the variables containing the external tools with an absolute path to the version that we currently get from nixpkgs. With this step nix takes care that all these packages are available at runtime when executing the script. All versions are finally fixed in flake.lock and can be upgraded manually.

The buildPhase runs the guile compiler that produces some intermediate code that will be loaded instead of compiling the script on-the-fly.

At last, installPhase creates a wrapper script that runs guile with the correct load-path pointing to guile-json and to our pre-compiled script. Additionally, trusted root certificates are exported to make the curl commands work. This script will be created in $out directory that is provided by nix.

If you now run nix build in the source root, it will execute all these phases and produce a symlink pointing to the result. You can then cat the resulting file if you are curious.

This way the script is completely isolated from the system it runs on - as long as the nix package manager is available. It includes all the external tools, as well as the underlying runtime (guile)! The result is a tiny wrapper bash script that can be run "everywhere" (modulo all the restrictions, like non-x86_64 platforms, of course :)).

Addon Descriptor🔗

At last, a small yaml file is needed to tell Docspell a little about the addon.

meta:
  name: "audio-files-addon"
  version: "0.1.0"
  description: |
    This addon adds support for audio files. Audio files are processed
    by a speech-to-text engine and a pdf is generated.

    It doesn't expect any user arguments at the moment. It requires
    internet access to download model files.

triggers:
  - final-process-item
  - final-reprocess-item
  - existing-item

runner:
  nix:
    enable: true

  docker:
    enable: false

  trivial:
    enable: true
    exec: src/addon.scm

options:
  networking: true
  collectOutput: true

This tells Docspell via triggers when this addon may be run. This one only makes sense for an item. Thus it can be hooked up to run with every file-processing job or a user can manually trigger it on an item.

It also tells via runner: that it can be build and run via nix, but not via docker (I gave up after an hour to create a Dockerfile…). It could also be run "as-is" but the user then needs to install all these tools and guile manually.

Done🔗

That's it. You can install this addon in Docspell and create a run configuration to let it execute when you want.

Audio file support

The idea

Preparation

The script

Helpers

Dependencies

Input/Output

Finding the audio file

Speech to text

Create PDF

Putting it together

Packaging

A flake

Package definition

Addon Descriptor

Done