aboutsummaryrefslogtreecommitdiffstats
path: root/doc/source/architecture.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/source/architecture.rst')
-rw-r--r--doc/source/architecture.rst71
1 files changed, 71 insertions, 0 deletions
diff --git a/doc/source/architecture.rst b/doc/source/architecture.rst
new file mode 100644
index 0000000..750319e
--- /dev/null
+++ b/doc/source/architecture.rst
@@ -0,0 +1,71 @@
+
+Architecture
+============
+
+
+The information extraction pipeline traverses through three stages of abstraction:
+
+1. File format
+2. Content
+3. Predicate-value pairs
+
+For example, an image can be stored in various file formats (JPEG, TIFF, PNG).
+In turn, a file format can store different kinds of information such as the image data (pixels) and additional metadata (image dimensions, EXIF tags).
+Finally, we translate the information read from the file into predicate-value pairs that can be attached to a file node in BSFS, e.g., ``(bse:filesize, 8150000)``, ``(bse:width, 6000)``, ``(bse:height, 4000)``, ``(bse:iso, 100)``, etc.
+
+The extraction pipeline is thus divided into
+:mod:`Readers <bsie.reader>` that abstract from file formats and content types,
+and :mod:`Extractors <bsie.extractor>` which produce predicate-value pairs from content artifacts.
+
+
+Readers
+-------
+
+:mod:`Readers <bsie.reader>` read the actual file (considering different file formats)
+and isolate specific content artifacts therein.
+The content artifact (in an internal representation)
+is then passed to an Extractor for further processing.
+
+For example, the :class:`Image <bsie.reader.image.Image>` reader aims at reading the content (pixels) of an image file.
+It automatically detects which python package (e.g., `rawpy`_, `pillow`_)
+to use when faced with the various existing image file formats.
+The image data is then converted into a PIL.Image instance
+(irrespective of which package was used to read the data),
+and passed on to the extractor.
+
+
+Extractors
+----------
+
+:mod:`Extractors <bsie.extractor>` turn content artifacts into
+predicate-value pairs that can be inserted into a BSFS storage.
+The predicate is defined by each extractor, as prescribed by BSFS' schema handling.
+
+For example, the class :class:`ColorsSpatial <bsie.extractor.image.colors_spatial.ColorsSpatial`
+determines regionally dominant colors from given pixel data.
+It then produces a feature vector and attaches it to the image file via the appropriate predicate.
+
+
+BSIE lib and apps
+-----------------
+
+The advantage of separating the reading and extraction steps is that multiple extractors
+can consume the same content, avoiding multiple re-reads of the same data.
+This close interaction between readers and extractors is encapsulated
+within the :class:`Pipeline <bsie.lib.pipeline.Pipeline>` class.
+
+Also, that having to deal with various file formats and content artifacts
+potentially pulls in a large number of dependencies.
+To make matters worse, many of those might not be needed in a specific scenario,
+e.g., if a user only works with a limited set of file formats.
+BSIE therefore implements a best-effort approach,
+that is modules that cannot be imported due to missing dependencies are ignored.
+
+With these two concerns taken care of,
+BSIE offers a few :mod:`end-user applications <bsie.apps>`
+that reduce the complexity of the task to a relatively simple command.
+
+
+
+.. _pillow: https://python-pillow.org/
+.. _rawpy: https://github.com/letmaik/rawpy