From e0c4713c40367b4b41da926da0ba7ed05d47d54b Mon Sep 17 00:00:00 2001 From: Matthias Baumgartner Date: Wed, 1 Mar 2023 22:05:06 +0100 Subject: documentation --- doc/source/architecture.rst | 71 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 doc/source/architecture.rst (limited to 'doc/source/architecture.rst') diff --git a/doc/source/architecture.rst b/doc/source/architecture.rst new file mode 100644 index 0000000..750319e --- /dev/null +++ b/doc/source/architecture.rst @@ -0,0 +1,71 @@ + +Architecture +============ + + +The information extraction pipeline traverses through three stages of abstraction: + +1. File format +2. Content +3. Predicate-value pairs + +For example, an image can be stored in various file formats (JPEG, TIFF, PNG). +In turn, a file format can store different kinds of information such as the image data (pixels) and additional metadata (image dimensions, EXIF tags). +Finally, we translate the information read from the file into predicate-value pairs that can be attached to a file node in BSFS, e.g., ``(bse:filesize, 8150000)``, ``(bse:width, 6000)``, ``(bse:height, 4000)``, ``(bse:iso, 100)``, etc. + +The extraction pipeline is thus divided into +:mod:`Readers ` that abstract from file formats and content types, +and :mod:`Extractors ` which produce predicate-value pairs from content artifacts. + + +Readers +------- + +:mod:`Readers ` read the actual file (considering different file formats) +and isolate specific content artifacts therein. +The content artifact (in an internal representation) +is then passed to an Extractor for further processing. + +For example, the :class:`Image ` reader aims at reading the content (pixels) of an image file. +It automatically detects which python package (e.g., `rawpy`_, `pillow`_) +to use when faced with the various existing image file formats. +The image data is then converted into a PIL.Image instance +(irrespective of which package was used to read the data), +and passed on to the extractor. + + +Extractors +---------- + +:mod:`Extractors ` turn content artifacts into +predicate-value pairs that can be inserted into a BSFS storage. +The predicate is defined by each extractor, as prescribed by BSFS' schema handling. + +For example, the class :class:`ColorsSpatial ` class. + +Also, that having to deal with various file formats and content artifacts +potentially pulls in a large number of dependencies. +To make matters worse, many of those might not be needed in a specific scenario, +e.g., if a user only works with a limited set of file formats. +BSIE therefore implements a best-effort approach, +that is modules that cannot be imported due to missing dependencies are ignored. + +With these two concerns taken care of, +BSIE offers a few :mod:`end-user applications ` +that reduce the complexity of the task to a relatively simple command. + + + +.. _pillow: https://python-pillow.org/ +.. _rawpy: https://github.com/letmaik/rawpy -- cgit v1.2.3