Configuration

Docspell Documentation

Docspell's executable can take one argument – a configuration file. If that is not given, the defaults are used. The config file overrides default values, so only values that differ from the defaults are necessary.

This applies to the restserver and the joex as well.

Important Config Options🔗

The configuration of both components uses separate namespaces. The configuration for the REST server is below docspell.server, while the one for joex is below docspell.joex.

You can therefore use two separate config files or one single file containing both namespaces.

JDBC🔗

This configures the connection to the database. This has to be specified for the rest server and joex. By default, a H2 database in the current /tmp directory is configured.

The config looks like this (both components):

docspell.joex.jdbc {
  url = ...
  user = ...
  password = ...
}

docspell.server.backend.jdbc {
  url = ...
  user = ...
  password = ...
}

The url is the connection to the database. It must start with jdbc, followed by name of the database. The rest is specific to the database used: it is either a path to a file for H2 or a host/database url for MariaDB and PostgreSQL.

When using H2, the user and password can be chosen freely on first start, but must stay the same on subsequent starts. Usually, the user is sa and the password is left empty. Additionally, the url must include these options:

;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE

Examples🔗

PostgreSQL:

url = "jdbc:postgresql://localhost:5432/docspelldb"

MariaDB:

url = "jdbc:mariadb://localhost:3306/docspelldb"

H2

url = "jdbc:h2:///path/to/a/file.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"

Admin Endpoint🔗

The admin endpoint defines some routes for adminstration tasks. This is disabled by default and can be enabled by providing a secret:

...
  admin-endpoint {
    secret = "123"
  }

This secret must be provided to all requests to a /api/v1/admin/ endpoint.

Full-Text Search: SOLR🔗

Apache SOLR is used to provide the full-text search. Both docspell components must provide the same connection setup. This is defined in the full-text-search.solr subsection:

...
  full-text-search {
    enabled = true
    ...
    solr = {
      url = "http://localhost:8983/solr/docspell"
    }
  }

The default configuration at the end of this page contains more information about each setting.

The solr.url is the mandatory setting that you need to change to point to your SOLR instance. Then you need to set the enabled flag to true.

When installing docspell manually, just install solr and create a core as described in the solr documentation. That will provide you with the connection url (the last part is the core name).

The full-text-search.solr options are the same for joex and the restserver.

There is an admin route that allows to re-create the entire index (for all collectives). This is possible via a call:

$ curl -XPOST -H "Docspell-Admin-Secret: test123" http://localhost:7880/api/v1/admin/fts/reIndexAll

Here the test123 is the key defined with admin-endpoint.secret. If it is empty (the default), this call is disabled (all admin routes). Otherwise, the POST request will submit a system task that is executed by a joex instance eventually.

Using this endpoint, the index will be re-created. This is sometimes necessary, for example if you upgrade SOLR or delete the core to provide a new one (see here for details). Note that a collective can also re-index their data using a similiar endpoint; but this is only deleting their data and doesn't do a full re-index.

The solr index doesn't contain any new information, it can be regenerated any time using the above REST call. Thus it doesn't need to be backed up.

Bind🔗

The host and port the http server binds to. This applies to both components. The joex component also exposes a small REST api to inspect its state and notify the scheduler.

docspell.server.bind {
  address = localhost
  port = 7880
}
docspell.joex.bind {
  address = localhost
  port = 7878
}

By default, it binds to localhost and some predefined port. This must be changed, if components are on different machines.

Baseurl🔗

The base url is an important setting that defines the http URL where the corresponding component can be reached. It applies to both components. For a joex component, the url must be resolvable from a REST server component. The REST server also uses this url to create absolute urls and to configure the authenication cookie.

By default it is build using the information from the bind setting, which is http://localhost:7880.

If the default is not changed, docspell will use the login request to determine the base-url. It first inspects the X-Forwarded-For header that is often used with reverse proxies. If that is not present, the Host header of the request is used. However, if the base-url setting is changed, then only this setting is used.

docspell.server.base-url = ...
docspell.joex.base-url = ...

Examples🔗

docspell.server.baseurl = "https://docspell.example.com"
docspell.joex.baseurl = "http://192.168.101.10"

App-id🔗

The app-id is the identifier of the corresponding instance. It must be unique for all instances. By default the REST server uses rest1 and joex joex1. It is recommended to overwrite this setting to have an explicit and stable identifier.

docspell.server.app-id = "rest1"
docspell.joex.app-id = "joex1"

Registration Options🔗

This defines if and how new users can create accounts. There are 3 options:

  • closed no new user can sign up
  • open new users can sign up
  • invite new users can sign up but require an invitation key

This applies only to the REST sevrer component.

docspell.server.backend.signup {
  mode = "open"

  # If mode == 'invite', a password must be provided to generate
  # invitation keys. It must not be empty.
  new-invite-password = ""

  # If mode == 'invite', this is the period an invitation token is
  # considered valid.
  invite-time = "3 days"
}

The mode invite is intended to open the application only to some users. The admin can create these invitation keys and distribute them to the desired people. For this, the new-invite-password must be given. The idea is that only the person who installs docspell knows this. If it is not set, then invitation won't work. New invitation keys can be generated from within the web application or via REST calls (using curl, for example).

curl -X POST -d '{"password":"blabla"}' "http://localhost:7880/api/v1/open/signup/newinvite"

Authentication🔗

Authentication works in two ways:

  • with an account-name / password pair
  • with an authentication token

The initial authentication must occur with an accountname/password pair. This will generate an authentication token which is valid for a some time. Subsequent calls to secured routes can use this token. The token can be given as a normal http header or via a cookie header.

These settings apply only to the REST server.

docspell.server.auth {
  server-secret = "hex:caffee" # or "b64:Y2FmZmVlCg=="
  session-valid = "5 minutes"
}

The server-secret is used to sign the token. If multiple REST servers are deployed, all must share the same server secret. Otherwise tokens from one instance are not valid on another instance. The secret can be given as Base64 encoded string or in hex form. Use the prefix hex: and b64:, respectively. If no prefix is given, the UTF8 bytes of the string are used.

The session-valid deterimens how long a token is valid. This can be just some minutes, the web application obtains new ones periodically. So a short time is recommended.

File Processing🔗

Files are being processed by the joex component. So all the respective configuration is in this config only.

File processing involves several stages, detailed information can be found here and in the corresponding sections in joex default config.

Configuration allows to define the external tools and set some limitations to control memory usage. The sections are:

  • docspell.joex.extraction
  • docspell.joex.text-analysis
  • docspell.joex.convert

Options to external commands can use variables that are replaced by values at runtime. Variables are enclosed in double braces {{…}}. Please see the default configuration for what variables exist per command.

Classification🔗

In text-analysis.classification you can define how many documents at most should be used for learning. The default settings should work well for most cases. However, it always depends on the amount of data and the machine that runs joex. For example, by default the documents to learn from are limited to 600 (classification.item-count) and every text is cut after 5000 characters (text-analysis.max-length). This is fine if most of your documents are small and only a few are near 5000 characters). But if all your documents are very large, you probably need to either assign more heap memory or go down with the limits.

Classification can be disabled, too, for when it's not needed.

NLP🔗

This setting defines which NLP mode to use. It defaults to full, which requires more memory for certain languages (with the advantage of better results). Other values are basic, regexonly and disabled. The modes full and basic use pre-defined lanugage models for procesing documents of languaes German, English and French. These require some amount of memory (see below).

The mode basic is like the "light" variant to full. It doesn't use all NLP features, which makes memory consumption much lower, but comes with the compromise of less accurate results.

The mode regexonly doesn't use pre-defined lanuage models, even if available. It checks your address book against a document to find metadata. That means, it is language independent. Also, when using full or basic with lanugages where no pre-defined models exist, it will degrade to regexonly for these.

The mode disabled skips NLP processing completely. This has least impact in memory consumption, obviously, but then only the classifier is used to find metadata (unless it is disabled, too).

You might want to try different modes and see what combination suits best your usage pattern and machine running joex. If a powerful machine is used, simply leave the defaults. When running on an raspberry pi, for example, you might need to adjust things.

Memory Usage🔗

The memory requirements for the joex component depends on the document language and the enabled features for text-analysis. The nlp.mode setting has significant impact, especially when your documents are in German. Here are some rough numbers on jvm heap usage (the same file was used for all tries):

nlp.modeEnglishGermanFrench
full420M950M490M
basic170M380M390M

Note that these are only rough numbers and they show the maximum used heap memory while processing a file.

When using mode=full, a heap setting of at least -Xmx1400M is recommended. For mode=basic a heap setting of at least -Xmx500M is recommended.

Other languages can't use these two modes, and so don't require this amount of memory (but don't have as good results). Then you can go with less heap. For these languages, the nlp mode is the same as regexonly.

Training the classifier is also memory intensive, which solely depends on the size and number of documents that are being trained. However, training the classifier is done periodically and can happen maybe every two weeks. When classifying new documents, memory requirements are lower, since the model already exists.

More details about these modes can be found here.

The restserver component is very lightweight, here you can use defaults.

File Format🔗

The format of the configuration files can be HOCON, JSON or whatever the used config library understands. The default values below are in HOCON format, which is recommended, since it allows comments and has some advanced features. Please refer to their documentation for more on this.

A short description (please see the links for better understanding): The config consists of key-value pairs and can be written in a JSON-like format (called HOCON). Keys are organized in trees, and a key defines a full path into the tree. There are two ways:

a.b.c.d=15

or

a {
  b {
    c {
      d = 15
    }
  }
}

Both are exactly the same and these forms are both used at the same time. Usually the braces approach is used to group some more settings, for better readability.

Default Config🔗

Rest Server🔗

docspell.server {
  # This is shown in the top right corner of the web application
  app-name = "Docspell"
  # This is the id of this node. If you run more than one server, you
  # have to make sure to provide unique ids per node.
  app-id = "rest1"
  # This is the base URL this application is deployed to. This is used
  # to create absolute URLs and to configure the cookie.
  #
  # If default is not changed, the HOST line of the login request is
  # used instead or the value of the `X-Forwarded-For` header. If set
  # to some other value, the request is not inspected.
  base-url = "http://localhost:7880"
  # Where the server binds to.
  bind {
    address = "localhost"
    port = 7880
  }
  # This is a hard limit to restrict the size of a batch that is
  # returned when searching for items. The user can set this limit
  # within the client config, but it is restricted by the server to
  # the number defined here. An admin might choose a lower number
  # depending on the available resources.
  max-item-page-size = 200
  # The number of characters to return for each item notes when
  # searching. Item notes may be very long, when returning them with
  # all the results from a search, they add quite some data to return.
  # In order to keep this low, a limit can be defined here.
  max-note-length = 180
  # This defines whether the classification form in the collective
  # settings is displayed or not. If all joex instances have document
  # classification disabled, it makes sense to hide its settings from
  # users.
  show-classification-settings = true
  # Authentication.
  auth {
    # The secret for this server that is used to sign the authenicator
    # tokens. If multiple servers are running, all must share the same
    # secret. You can use base64 or hex strings (prefix with b64: and
    # hex:, respectively).
    server-secret = "hex:caffee"
    # How long an authentication token is valid. The web application
    # will get a new one periodically.
    session-valid = "5 minutes"
    remember-me {
      enabled = true
      # How long the remember me cookie/token is valid.
      valid = "30 days"
    }
  }
  # This endpoint allows to upload files to any collective. The
  # intention is that local software integrates with docspell more
  # easily. Therefore the endpoint is not protected by the usual
  # means.
  #
  # For security reasons, this endpoint is disabled by default. If
  # enabled, you can choose from some ways to protect it. It may be a
  # good idea to further protect this endpoint using a firewall, such
  # that outside traffic is not routed.
  #
  # NOTE: If all protection methods are disabled, the endpoint is not
  # protected at all!
  integration-endpoint {
    enabled = false
    # The priority to use when submitting files through this endpoint.
    priority = "low"
    # The name used for the item "source" property when uploaded
    # through this endpoint.
    source-name = "integration"
    # IPv4 addresses to allow access. An empty list, if enabled,
    # prohibits all requests. IP addresses may be specified as simple
    # globs: a part marked as `*' matches any octet, like in
    # `192.168.*.*`. The `127.0.0.1' (the default) matches the
    # loopback address.
    allowed-ips {
      enabled = false
      ips = [ "127.0.0.1" ]
    }
    # Requests are expected to use http basic auth when uploading
    # files.
    http-basic {
      enabled = false
      realm = "Docspell Integration"
      user = "docspell-int"
      password = "docspell-int"
    }
    # Requests are expected to supply some specific header when
    # uploading files.
    http-header {
      enabled = false
      header-name = "Docspell-Integration"
      header-value = "some-secret"
    }
  }
  # This is a special endpoint that allows some basic administration.
  #
  # It is intended to be used by admins only, that is users who
  # installed the app and have access to the system. Normal users
  # should not have access and therefore a secret must be provided in
  # order to access it.
  #
  # This is used for some endpoints, for example:
  # - re-create complete fulltext index:
  #   curl -XPOST -H'Docspell-Admin-Secret: xyz' http://localhost:7880/api/v1/admin/fts/reIndexAll
  admin-endpoint {
    # The secret. If empty, the endpoint is disabled.
    secret = ""
  }
  # Configuration of the full-text search engine.
  full-text-search {
    # The full-text search feature can be disabled. It requires an
    # additional index server which needs additional memory and disk
    # space. It can be enabled later any time.
    #
    # Currently the SOLR search platform is supported.
    enabled = false
    # Configuration for the SOLR backend.
    solr = {
      # The URL to solr
      url = "http://localhost:8983/solr/docspell"
      # Used to tell solr when to commit the data
      commit-within = 1000
      # If true, logs request and response bodies
      log-verbose = false
      # The defType parameter to lucene that defines the parser to
      # use. You might want to try "edismax" or look here:
      # https://lucene.apache.org/solr/guide/8_4/query-syntax-and-parsing.html#query-syntax-and-parsing
      def-type = "lucene"
      # The default combiner for tokens. One of {AND, OR}.
      q-op = "OR"
    }
  }
  # Configuration for the backend.
  backend {
    # Enable or disable debugging for e-mail related functionality. This
    # applies to both sending and receiving mails. For security reasons
    # logging is not very extensive on authentication failures. Setting
    # this to true, results in a lot of data printed to stdout.
    mail-debug = false
    # The database connection.
    #
    # By default a H2 file-based database is configured. You can
    # provide a postgresql or mariadb connection here. When using H2
    # use the PostgreSQL compatibility mode and AUTO_SERVER feature.
    jdbc {
      url = "jdbc:h2://"${java.io.tmpdir}"/docspell-demo.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
      user = "sa"
      password = ""
    }
    # Configuration for registering new users.
    signup {
      # The mode defines if new users can signup or not. It can have
      # three values:
      #
      # - open: every new user can sign up
      # - invite: new users can sign up only if they provide a correct
      #   invitation key. Invitation keys can be generated by the
      #   server.
      # - closed: signing up is disabled.
      mode = "open"
      # If mode == 'invite', a password must be provided to generate
      # invitation keys. It must not be empty.
      new-invite-password = ""
      # If mode == 'invite', this is the period an invitation token is
      # considered valid.
      invite-time = "3 days"
    }
    files {
      # Defines the chunk size (in bytes) used to store the files.
      # This will affect the memory footprint when uploading and
      # downloading files. At most this amount is loaded into RAM for
      # down- and uploading.
      #
      # It also defines the chunk size used for the blobs inside the
      # database.
      chunk-size = 524288
      # The file content types that are considered valid. Docspell
      # will only pass these files to processing. The processing code
      # itself has also checks for which files are supported and which
      # not. This affects the uploading part and can be used to
      # restrict file types that should be handed over to processing.
      # By default all files are allowed.
      valid-mime-types = [ ]
    }
  }
}

Joex🔗

docspell.joex {
  # This is the id of this node. If you run more than one server, you
  # have to make sure to provide unique ids per node.
  app-id = "joex1"
  # This is the base URL this application is deployed to. This is used
  # to register this joex instance such that docspell rest servers can
  # reach them
  base-url = "http://localhost:7878"
  # Where the REST server binds to.
  #
  # JOEX provides a very simple REST interface to inspect its state.
  bind {
    address = "localhost"
    port = 7878
  }
  # The database connection.
  #
  # By default a H2 file-based database is configured. You can provide
  # a postgresql or mariadb connection here. When using H2 use the
  # PostgreSQL compatibility mode and AUTO_SERVER feature.
  #
  # It must be the same connection as the rest server is using.
  jdbc {
    url = "jdbc:h2://"${java.io.tmpdir}"/docspell-demo.db;MODE=PostgreSQL;DATABASE_TO_LOWER=TRUE;AUTO_SERVER=TRUE"
    user = "sa"
    password = ""
  }
  # Enable or disable debugging for e-mail related functionality. This
  # applies to both sending and receiving mails. For security reasons
  # logging is not very extensive on authentication failures. Setting
  # this to true, results in a lot of data printed to stdout.
  mail-debug = false
  send-mail {
    # This is used as the List-Id e-mail header when mails are sent
    # from docspell to its users (example: for notification mails). It
    # is not used when sending to external recipients. If it is empty,
    # no such header is added. Using this header is often useful when
    # filtering mails.
    #
    # It should be a string in angle brackets. See
    # https://tools.ietf.org/html/rfc2919 for a formal specification
    # of this header.
    list-id = ""
  }
  # Configuration for the job scheduler.
  scheduler {
    # Each scheduler needs a unique name. This defaults to the node
    # name, which must be unique, too.
    name = ${docspell.joex.app-id}
    # Number of processing allowed in parallel.
    pool-size = 1
    # A counting scheme determines the ratio of how high- and low-prio
    # jobs are run. For example: 4,1 means run 4 high prio jobs, then
    # 1 low prio and then start over.
    counting-scheme = "4,1"
    # How often a failed job should be retried until it enters failed
    # state. If a job fails, it becomes "stuck" and will be retried
    # after a delay.
    retries = 2
    # The delay until the next try is performed for a failed job. This
    # delay is increased exponentially with the number of retries.
    retry-delay = "1 minute"
    # The queue size of log statements from a job.
    log-buffer-size = 500
    # If no job is left in the queue, the scheduler will wait until a
    # notify is requested (using the REST interface). To also retry
    # stuck jobs, it will notify itself periodically.
    wakeup-period = "30 minutes"
  }
  periodic-scheduler {
    # Each scheduler needs a unique name. This defaults to the node
    # name, which must be unique, too.
    name = ${docspell.joex.app-id}
    # A fallback to start looking for due periodic tasks regularily.
    # Usually joex instances should be notified via REST calls if
    # external processes change tasks. But these requests may get
    # lost.
    wakeup-period = "10 minutes"
  }
  # Configuration for the user-tasks.
  user-tasks {
    # Allows to import e-mails by scanning a mailbox.
    scan-mailbox {
      # A limit of how many folders to scan through. If a user
      # configures more than this, only upto this limit folders are
      # scanned and a warning is logged.
      max-folders = 50
      # How many mails (headers only) to retrieve in one chunk.
      #
      # If this is greater than `max-mails' it is set automatically to
      # the value of `max-mails'.
      mail-chunk-size = 50
      # A limit on how many mails to process in one job run. This is
      # meant to avoid too heavy resource allocation to one
      # user/collective.
      #
      # If more than this number of mails is encountered, a warning is
      # logged.
      max-mails = 500
    }
  }
  # Docspell uses periodic house keeping tasks, like cleaning expired
  # invites, that can be configured here.
  house-keeping {
    # When the house keeping tasks execute. Default is to run every
    # week.
    schedule = "Sun *-*-* 00:00:00"
    # This task removes invitation keys that have been created but not
    # used. The timespan here must be greater than the `invite-time'
    # setting in the rest server config file.
    cleanup-invites = {
      # Whether this task is enabled.
      enabled = true
      # The minimum age of invites to be deleted.
      older-than = "30 days"
    }
    # This task removes expired remember-me tokens. The timespan
    # should be greater than the `valid` time in the restserver
    # config.
    cleanup-remember-me = {
      # Whether the job is enabled.
      enabled = true
      # The minimum age of tokens to be deleted.
      older-than = "30 days"
    }
    # Jobs store their log output in the database. Normally this data
    # is only interesting for some period of time. The processing logs
    # of old files can be removed eventually.
    cleanup-jobs = {
      # Whether this task is enabled.
      enabled = true
      # The minimum age of jobs to delete. It is matched against the
      # `finished' timestamp.
      older-than = "30 days"
      # This defines how many jobs are deleted in one transaction.
      # Since the data to delete may get large, it can be configured
      # whether more or less memory should be used.
      delete-batch = "100"
    }
    # Removes node entries that are not reachable anymore.
    check-nodes {
      # Whether this task is enabled
      enabled = true
      # How often the node must be unreachable, before it is removed.
      min-not-found = 2
    }
  }
  # Configuration of text extraction
  extraction {
    # For PDF files it is first tried to read the text parts of the
    # PDF. But PDFs can be complex documents and they may contain text
    # and images. If the returned text is shorter than the value
    # below, OCR is run afterwards. Then both extracted texts are
    # compared and the longer will be used.
    pdf {
      min-text-len = 500
    }
    preview {
      # When rendering a pdf page, use this dpi. This results in
      # scaling the image. A standard A4 page rendered at 96dpi
      # results in roughly 790x1100px image. Using 32 results in
      # roughly 200x300px image.
      #
      # Note, when this is changed, you might want to re-generate
      # preview images. Check the api for this, there is an endpoint
      # to regenerate all for a collective.
      dpi = 32
    }
    # Extracting text using OCR works for image and pdf files. It will
    # first run ghostscript to create a gray image from a pdf. Then
    # unpaper is run to optimize the image for the upcoming ocr, which
    # will be done by tesseract. All these programs must be available
    # in your PATH or the absolute path can be specified below.
    ocr {
      # Images greater than this size are skipped. Note that every
      # image is loaded completely into memory for doing OCR. This is
      # the pixel count, `height * width` of the image.
      max-image-size = 14000000
      # Defines what pages to process. If a PDF with 600 pages is
      # submitted, it is probably not necessary to scan through all of
      # them. This would take a long time and occupy resources for no
      # value. The first few pages should suffice. The default is first
      # 10 pages.
      #
      # If you want all pages being processed, set this number to -1.
      #
      # Note: if you change the ghostscript command below, be aware that
      # this setting (if not -1) will add another parameter to the
      # beginning of the command.
      page-range {
        begin = 10
      }
      # The ghostscript command.
      ghostscript {
        command {
          program = "gs"
          args = [ "-dNOPAUSE"
                 , "-dBATCH"
                 , "-dSAFER"
                 , "-sDEVICE=tiffscaled8"
                 , "-sOutputFile={{outfile}}"
                 , "{{infile}}"
                 ]
          timeout = "5 minutes"
        }
        working-dir = ${java.io.tmpdir}"/docspell-extraction"
      }
      # The unpaper command.
      unpaper {
        command {
          program = "unpaper"
          args = [ "{{infile}}", "{{outfile}}" ]
          timeout = "5 minutes"
        }
      }
      # The tesseract command.
      tesseract {
        command {
          program = "tesseract"
          args = ["{{file}}"
                 , "stdout"
                 , "-l"
                 , "{{lang}}"
                 ]
          timeout = "5 minutes"
        }
      }
    }
  }
  # Settings for text analysis
  text-analysis {
    # Maximum length of text to be analysed.
    #
    # All text to analyse must fit into RAM. A large document may take
    # too much heap. Also, most important information is at the
    # beginning of a document, so in most cases the first two pages
    # should suffice. Default is 5000, which are about 2 pages (just a
    # rough guess, of course). For my data, more than 80% of the
    # documents are less than 5000 characters.
    #
    # This values applies to nlp and the classifier. If this value is
    # <= 0, the limit is disabled.
    max-length = 5000
    # A working directory for the analyser to store temporary/working
    # files.
    working-dir = ${java.io.tmpdir}"/docspell-analysis"
    nlp {
      # The mode for configuring NLP models:
      #
      # 1. full – builds the complete pipeline
      # 2. basic - builds only the ner annotator
      # 3. regexonly - matches each entry in your address book via regexps
      # 4. disabled - doesn't use any stanford-nlp feature
      #
      # The full and basic variants rely on pre-build language models
      # that are available for only a few languages. Memory usage
      # varies among the languages. So joex should run with -Xmx1400M
      # at least when using mode=full.
      #
      # The basic variant does a quite good job for German and
      # English. It might be worse for French, always depending on the
      # type of text that is analysed. Joex should run with about 500M
      # heap, here again lanugage German uses the most.
      #
      # The regexonly variant doesn't depend on a language. It roughly
      # works by converting all entries in your addressbook into
      # regexps and matches each one against the text. This can get
      # memory intensive, too, when the addressbook grows large. This
      # is included in the full and basic by default, but can be used
      # independently by setting mode=regexner.
      #
      # When mode=disabled, then the whole nlp pipeline is disabled,
      # and you won't get any suggestions. Only what the classifier
      # returns (if enabled).
      mode = full
      # The StanfordCoreNLP library caches language models which
      # requires quite some amount of memory. Setting this interval to a
      # positive duration, the cache is cleared after this amount of
      # idle time. Set it to 0 to disable it if you have enough memory,
      # processing will be faster.
      #
      # This has only any effect, if mode != disabled.
      clear-interval = "15 minutes"
      # Restricts proposals for due dates. Only dates earlier than this
      # number of years in the future are considered.
      max-due-date-years = 10
      regex-ner {
        # Whether to enable custom NER annotation. This uses the
        # address book of a collective as input for NER tagging (to
        # automatically find correspondent and concerned entities). If
        # the address book is large, this can be quite memory
        # intensive and also makes text analysis much slower. But it
        # improves accuracy and can be used independent of the
        # lanugage. If this is set to 0, it is effectively disabled
        # and NER tagging uses only statistical models (that also work
        # quite well, but are restricted to the languages mentioned
        # above).
        #
        # Note, this is only relevant if nlp-config.mode is not
        # "disabled".
        max-entries = 1000
        # The NER annotation uses a file of patterns that is derived
        # from a collective's address book. This is is the time how
        # long this data will be kept until a check for a state change
        # is done.
        file-cache-time = "1 minute"
      }
    }
    # Settings for doing document classification.
    #
    # This works by learning from existing documents. This requires a
    # satstical model that is computed from all existing documents.
    # This process is run periodically as configured by the
    # collective. It may require more memory, depending on the amount
    # of data.
    #
    # It utilises this NLP library: https://nlp.stanford.edu/.
    classification {
      # Whether to enable classification globally. Each collective can
      # enable/disable auto-tagging. The classifier is also used for
      # finding correspondents and concerned entities, if enabled
      # here.
      enabled = true
      # If concerned with memory consumption, this restricts the
      # number of items to consider. More are better for training. A
      # negative value or zero means to train on all items.
      #
      # This limit and `text-analysis.max-length` define how much
      # memory is required. On weaker hardware, it is advised to play
      # with these values.
      item-count = 600
      # These settings are used to configure the classifier. If
      # multiple are given, they are all tried and the "best" is
      # chosen at the end. See
      # https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/classify/ColumnDataClassifier.html
      # for more info about these settings. The settings here yielded
      # good results with *my* dataset.
      #
      # Enclose regexps in triple quotes.
      classifiers = [
        { "useSplitWords" = "true"
          "splitWordsTokenizerRegexp" = """[\p{L}][\p{L}0-9]*|(?:\$ ?)?[0-9]+(?:\.[0-9]{2})?%?|\s+|."""
          "splitWordsIgnoreRegexp" = """\s+"""
          "useSplitPrefixSuffixNGrams" = "true"
          "maxNGramLeng" = "4"
          "minNGramLeng" = "1"
          "splitWordShape" = "chris4"
          "intern" = "true" # makes it slower but saves memory
        }
      ]
    }
  }
  # Configuration for converting files into PDFs.
  #
  # Most of it is delegated to external tools, which can be configured
  # below. They must be in the PATH environment or specify the full
  # path below via the `program` key.
  convert {
    # The chunk size used when storing files. This should be the same
    # as used with the rest server.
    chunk-size = 524288
    # A string used to change the filename of the converted pdf file.
    # If empty, the original file name is used for the pdf file ( the
    # extension is always replaced with `pdf`).
    converted-filename-part = "converted"
    # When reading images, this is the maximum size. Images that are
    # larger are not processed.
    max-image-size = ${docspell.joex.extraction.ocr.max-image-size}
    # Settings when processing markdown files (and other text files)
    # to HTML.
    #
    # In order to support text formats, text files are first converted
    # to HTML using a markdown processor. The resulting HTML is then
    # converted to a PDF file.
    markdown {
      # The CSS that is used to style the resulting HTML.
      internal-css = """
        body { padding: 2em 5em; }
      """
    }
    # To convert HTML files into PDF files, the external tool
    # wkhtmltopdf is used.
    wkhtmlpdf {
      command = {
        program = "wkhtmltopdf"
        args = [
          "-s",
          "A4",
          "--encoding",
          "{{encoding}}",
          "--load-error-handling", "ignore",
          "--load-media-error-handling", "ignore",
          "-",
          "{{outfile}}"
        ]
        timeout = "2 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
    # To convert image files to PDF files, tesseract is used. This
    # also extracts the text in one go.
    tesseract = {
      command = {
        program = "tesseract"
        args = [
          "{{infile}}",
          "out",
          "-l",
          "{{lang}}",
          "pdf",
          "txt"
        ]
        timeout = "5 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
    # To convert "office" files to PDF files, the external tool
    # unoconv is used. Unoconv uses libreoffice/openoffice for
    # converting. So it supports all formats that are possible to read
    # with libreoffice/openoffic.
    #
    # Note: to greatly improve performance, it is recommended to start
    # a libreoffice listener by running `unoconv -l` in a separate
    # process.
    unoconv = {
      command = {
        program = "unoconv"
        args = [
          "-f",
          "pdf",
          "-o",
          "{{outfile}}",
          "{{infile}}"
        ]
        timeout = "2 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
    # The tool ocrmypdf can be used to convert pdf files to pdf files
    # in order to add extracted text as a separate layer. This makes
    # image-only pdfs searchable and you can select and copy/paste the
    # text. It also converts pdfs into pdf/a type pdfs, which are best
    # suited for archiving. So it makes sense to use this even for
    # text-only pdfs.
    #
    # It is recommended to install ocrympdf, but it also is optional.
    # If it is enabled but fails, the error is not fatal and the
    # processing will continue using the original pdf for extracting
    # text. You can also disable it to remove the errors from the
    # processing logs.
    #
    # The `--skip-text` option is necessary to not fail on "text" pdfs
    # (where ocr is not necessary). In this case, the pdf will be
    # converted to PDF/A.
    ocrmypdf = {
      enabled = true
      command = {
        program = "ocrmypdf"
        args = [
          "-l", "{{lang}}",
          "--skip-text",
          "--deskew",
          "-j", "1",
          "{{infile}}",
          "{{outfile}}"
        ]
        timeout = "5 minutes"
      }
      working-dir = ${java.io.tmpdir}"/docspell-convert"
    }
  }
  # The same section is also present in the rest-server config. It is
  # used when submitting files into the job queue for processing.
  #
  # Currently, these settings may affect memory usage of all nodes, so
  # it should be the same on all nodes.
  files {
    # Defines the chunk size (in bytes) used to store the files.
    # This will affect the memory footprint when uploading and
    # downloading files. At most this amount is loaded into RAM for
    # down- and uploading.
    #
    # It also defines the chunk size used for the blobs inside the
    # database.
    chunk-size = 524288
    # The file content types that are considered valid. Docspell
    # will only pass these files to processing. The processing code
    # itself has also checks for which files are supported and which
    # not. This affects the uploading part and can be used to
    # restrict file types that should be handed over to processing.
    # By default all files are allowed.
    valid-mime-types = [ ]
  }
  # Configuration of the full-text search engine.
  full-text-search {
    # The full-text search feature can be disabled. It requires an
    # additional index server which needs additional memory and disk
    # space. It can be enabled later any time.
    #
    # Currently the SOLR search platform is supported.
    enabled = false
    # Configuration for the SOLR backend.
    solr = {
      # The URL to solr
      url = "http://localhost:8983/solr/docspell"
      # Used to tell solr when to commit the data
      commit-within = 1000
      # If true, logs request and response bodies
      log-verbose = false
      # The defType parameter to lucene that defines the parser to
      # use. You might want to try "edismax" or look here:
      # https://lucene.apache.org/solr/guide/8_4/query-syntax-and-parsing.html#query-syntax-and-parsing
      def-type = "lucene"
      # The default combiner for tokens. One of {AND, OR}.
      q-op = "OR"
    }
    # Settings for running the index migration tasks
    migration = {
      # Chunk size to use when indexing data from the database. This
      # many attachments are loaded into memory and pushed to the
      # full-text index.
      index-all-chunk = 10
    }
  }
}

Logging🔗

By default, docspell logs to stdout. This works well, when managed by systemd or other inits. Logging is done by logback. Please refer to its documentation for how to configure logging.

If you created your logback config file, it can be added as argument to the executable using this syntax:

/path/to/docspell -Dlogback.configurationFile=/path/to/your/logging-config-file

To get started, the default config looks like this:

<configuration>
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <withJansi>true</withJansi>

    <encoder>
      <pattern>[%thread] %highlight(%-5level) %cyan(%logger{15}) - %msg %n</pattern>
    </encoder>
  </appender>

  <logger name="docspell" level="debug" />
  <root level="INFO">
    <appender-ref ref="STDOUT" />
  </root>
</configuration>

The <root level="INFO"> means, that only log statements with level "INFO" will be printed. But the <logger name="docspell" level="debug"> above says, that for loggers with name "docspell" statements with level "DEBUG" will be printed, too.