diff options
Diffstat (limited to 'doc/source/architecture.rst')
-rw-r--r-- | doc/source/architecture.rst | 71 |
1 files changed, 71 insertions, 0 deletions
diff --git a/doc/source/architecture.rst b/doc/source/architecture.rst new file mode 100644 index 0000000..750319e --- /dev/null +++ b/doc/source/architecture.rst @@ -0,0 +1,71 @@ + +Architecture +============ + + +The information extraction pipeline traverses through three stages of abstraction: + +1. File format +2. Content +3. Predicate-value pairs + +For example, an image can be stored in various file formats (JPEG, TIFF, PNG). +In turn, a file format can store different kinds of information such as the image data (pixels) and additional metadata (image dimensions, EXIF tags). +Finally, we translate the information read from the file into predicate-value pairs that can be attached to a file node in BSFS, e.g., ``(bse:filesize, 8150000)``, ``(bse:width, 6000)``, ``(bse:height, 4000)``, ``(bse:iso, 100)``, etc. + +The extraction pipeline is thus divided into +:mod:`Readers <bsie.reader>` that abstract from file formats and content types, +and :mod:`Extractors <bsie.extractor>` which produce predicate-value pairs from content artifacts. + + +Readers +------- + +:mod:`Readers <bsie.reader>` read the actual file (considering different file formats) +and isolate specific content artifacts therein. +The content artifact (in an internal representation) +is then passed to an Extractor for further processing. + +For example, the :class:`Image <bsie.reader.image.Image>` reader aims at reading the content (pixels) of an image file. +It automatically detects which python package (e.g., `rawpy`_, `pillow`_) +to use when faced with the various existing image file formats. +The image data is then converted into a PIL.Image instance +(irrespective of which package was used to read the data), +and passed on to the extractor. + + +Extractors +---------- + +:mod:`Extractors <bsie.extractor>` turn content artifacts into +predicate-value pairs that can be inserted into a BSFS storage. +The predicate is defined by each extractor, as prescribed by BSFS' schema handling. + +For example, the class :class:`ColorsSpatial <bsie.extractor.image.colors_spatial.ColorsSpatial` +determines regionally dominant colors from given pixel data. +It then produces a feature vector and attaches it to the image file via the appropriate predicate. + + +BSIE lib and apps +----------------- + +The advantage of separating the reading and extraction steps is that multiple extractors +can consume the same content, avoiding multiple re-reads of the same data. +This close interaction between readers and extractors is encapsulated +within the :class:`Pipeline <bsie.lib.pipeline.Pipeline>` class. + +Also, that having to deal with various file formats and content artifacts +potentially pulls in a large number of dependencies. +To make matters worse, many of those might not be needed in a specific scenario, +e.g., if a user only works with a limited set of file formats. +BSIE therefore implements a best-effort approach, +that is modules that cannot be imported due to missing dependencies are ignored. + +With these two concerns taken care of, +BSIE offers a few :mod:`end-user applications <bsie.apps>` +that reduce the complexity of the task to a relatively simple command. + + + +.. _pillow: https://python-pillow.org/ +.. _rawpy: https://github.com/letmaik/rawpy |