Files are being processed by the joex component. So all the respective configuration is in this config only.
File processing involves several stages, detailed information can be found here and in the corresponding sections in joex default config.
Configuration allows to define the external tools and set some limitations to control memory usage. The sections are:
Options to external commands can use variables that are replaced by
values at runtime. Variables are enclosed in double braces
Please see the default configuration for what variables exist per
text-analysis.classification you can define how many documents at
most should be used for learning. The default settings should work
well for most cases. However, it always depends on the amount of data
and the machine that runs joex. For example, by default the documents
to learn from are limited to 600 (
every text is cut after 5000 characters (
This is fine if most of your documents are small and only a few are
near 5000 characters). But if all your documents are very large, you
probably need to either assign more heap memory or go down with the
Classification can be disabled, too, for when it's not needed.
This setting defines which NLP mode to use. It defaults to
which requires more memory for certain languages (with the advantage
of better results). Other values are
disabled. The modes
basic use pre-defined lanugage
models for procesing documents of languaes German, English, French and
Spanish. These require some amount of memory (see below).
basic is like the "light" variant to
full. It doesn't use
all NLP features, which makes memory consumption much lower, but comes
with the compromise of less accurate results.
regexonly doesn't use pre-defined lanuage models, even if
available. It checks your address book against a document to find
metadata. That means, it is language independent. Also, when using
basic with lanugages where no pre-defined models exist, it
will degrade to
regexonly for these.
disabled skips NLP processing completely. This has least
impact in memory consumption, obviously, but then only the classifier
is used to find metadata (unless it is disabled, too).
You might want to try different modes and see what combination suits best your usage pattern and machine running joex. If a powerful machine is used, simply leave the defaults. When running on an raspberry pi, for example, you might need to adjust things.
The memory requirements for the joex component depends on the document
language and the enabled features for text-analysis. The
setting has significant impact, especially when your documents are in
German. Here are some rough numbers on jvm heap usage (the same file
was used for all tries):
Note that these are only rough numbers and they show the maximum used heap memory while processing a file.
mode=full, a heap setting of at least
mode=basic a heap setting of at least
Other languages can't use these two modes, and so don't require this
amount of memory (but don't have as good results). Then you can go
with less heap. For these languages, the nlp mode is the same as
Training the classifier is also memory intensive, which solely depends on the size and number of documents that are being trained. However, training the classifier is done periodically and can happen maybe every two weeks. When classifying new documents, memory requirements are lower, since the model already exists.
More details about these modes can be found here.
The restserver component is very lightweight, here you can use defaults.