Everything, including all files, are stored in the database.
Now that seems to put off some people coming to Docspell, so here are some thoughts on why this is and why it may be not such a big deal. It was a conscious decision and the option to hold all files in the file system was considered, but not chosen in the end.
First, it was clear that a database is required in order to support the planned features. It is required to efficiently support a multi-user application: the account data, passwords and many other things (tags, metadata etc) must be stored and queried reliably. Very often a relational model emerges and a database is the best fit, otherwise one would just "reinvent the wheel". So the options are to have a database and files in the filesystem or everything in one database. There are, of course, pros and cons for both ways, these were the reasons for the current decision:
Of course, there is no guarantee that this project will be alive in the future. It is important to know how to use your data then.
A very important thing is: it is FREE software (as in freedom and in beer). That is, you can be sure to use the current version for as long as you want. So it is a good idea to backup the releases (or docker images) you are using alongside with your data. This ensures that you will be able to use your data "forever". This also means that you can inspect the data model and use the api and/or standard SQL tools to get all the data. While this may be difficult/inconvenient, the point here is only that it is possible. It's not hidden or obscured, nothing is lost. You can even backup the sources to keep this documentation, too.
In order to move to a different tool, it is necessary to get the data
out of Docspell in a machine readable/automatic way. Currently, there
is a export-files.sh script provided
tools/ folder) that can be used to download all your files
and item metadata.
My recommendation is to run periodic database backups and also store the binaries/docker images. This lets you re-create the current state any time which allows to postpone the problem of getting the data in a specific format out of Docspell.
Note that you don't need to backup the SOLR instance (if you're using fulltext search), since it can be recreated by Docspell.
Documents are not ocr-ed twice normally. Doscpell first extracts the text from a pdf. If this is below some configurable minimum length, it will still run OCR just to see if that gives more. Then the longer of the texts is taken. By default it will hand all pdfs to ocrmypdf, but this will skip already ocred files. The whole ocrmypdf process can be switched off in the config file. So if you only have these pdfs, this would be an option, I guess. Alternatively, it is possible to change the ocrmypdf options in docspell's config file to fit your requirements.
Back when Docspell started, there weren't as many options as there are now. I wanted to try out a different approach. You can read more about that here.