diff options
author | Matthias Baumgartner <dev@igsor.net> | 2023-03-05 19:22:46 +0100 |
---|---|---|
committer | Matthias Baumgartner <dev@igsor.net> | 2023-03-05 19:22:46 +0100 |
commit | af81318ae9311fd0b0e16949cef3cfaf7996970b (patch) | |
tree | fb220da28bb7248ebf37ce09af5de88f2c1aaad4 /doc/source | |
parent | 7bf6b33fa6d6b901e4933bfe0b2a9939d7b3f3f3 (diff) | |
parent | 8b460aa0232cd841af7b7734c91982bc83486e03 (diff) | |
download | bsie-af81318ae9311fd0b0e16949cef3cfaf7996970b.tar.gz bsie-af81318ae9311fd0b0e16949cef3cfaf7996970b.tar.bz2 bsie-af81318ae9311fd0b0e16949cef3cfaf7996970b.zip |
Merge branch 'mb/diogenes' into develop
Diffstat (limited to 'doc/source')
-rw-r--r-- | doc/source/architecture.rst | 71 | ||||
-rw-r--r-- | doc/source/conf.py | 37 | ||||
-rw-r--r-- | doc/source/index.rst | 26 | ||||
-rw-r--r-- | doc/source/installation.rst | 75 |
4 files changed, 209 insertions, 0 deletions
diff --git a/doc/source/architecture.rst b/doc/source/architecture.rst new file mode 100644 index 0000000..750319e --- /dev/null +++ b/doc/source/architecture.rst @@ -0,0 +1,71 @@ + +Architecture +============ + + +The information extraction pipeline traverses through three stages of abstraction: + +1. File format +2. Content +3. Predicate-value pairs + +For example, an image can be stored in various file formats (JPEG, TIFF, PNG). +In turn, a file format can store different kinds of information such as the image data (pixels) and additional metadata (image dimensions, EXIF tags). +Finally, we translate the information read from the file into predicate-value pairs that can be attached to a file node in BSFS, e.g., ``(bse:filesize, 8150000)``, ``(bse:width, 6000)``, ``(bse:height, 4000)``, ``(bse:iso, 100)``, etc. + +The extraction pipeline is thus divided into +:mod:`Readers <bsie.reader>` that abstract from file formats and content types, +and :mod:`Extractors <bsie.extractor>` which produce predicate-value pairs from content artifacts. + + +Readers +------- + +:mod:`Readers <bsie.reader>` read the actual file (considering different file formats) +and isolate specific content artifacts therein. +The content artifact (in an internal representation) +is then passed to an Extractor for further processing. + +For example, the :class:`Image <bsie.reader.image.Image>` reader aims at reading the content (pixels) of an image file. +It automatically detects which python package (e.g., `rawpy`_, `pillow`_) +to use when faced with the various existing image file formats. +The image data is then converted into a PIL.Image instance +(irrespective of which package was used to read the data), +and passed on to the extractor. + + +Extractors +---------- + +:mod:`Extractors <bsie.extractor>` turn content artifacts into +predicate-value pairs that can be inserted into a BSFS storage. +The predicate is defined by each extractor, as prescribed by BSFS' schema handling. + +For example, the class :class:`ColorsSpatial <bsie.extractor.image.colors_spatial.ColorsSpatial` +determines regionally dominant colors from given pixel data. +It then produces a feature vector and attaches it to the image file via the appropriate predicate. + + +BSIE lib and apps +----------------- + +The advantage of separating the reading and extraction steps is that multiple extractors +can consume the same content, avoiding multiple re-reads of the same data. +This close interaction between readers and extractors is encapsulated +within the :class:`Pipeline <bsie.lib.pipeline.Pipeline>` class. + +Also, that having to deal with various file formats and content artifacts +potentially pulls in a large number of dependencies. +To make matters worse, many of those might not be needed in a specific scenario, +e.g., if a user only works with a limited set of file formats. +BSIE therefore implements a best-effort approach, +that is modules that cannot be imported due to missing dependencies are ignored. + +With these two concerns taken care of, +BSIE offers a few :mod:`end-user applications <bsie.apps>` +that reduce the complexity of the task to a relatively simple command. + + + +.. _pillow: https://python-pillow.org/ +.. _rawpy: https://github.com/letmaik/rawpy diff --git a/doc/source/conf.py b/doc/source/conf.py new file mode 100644 index 0000000..017e036 --- /dev/null +++ b/doc/source/conf.py @@ -0,0 +1,37 @@ +# Configuration file for the Sphinx documentation builder. +# +# For the full list of built-in configuration values, see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Project information ----------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information + +project = 'Black Star Information Extraction' +copyright = '2023, Matthias Baumgartner' +author = 'Matthias Baumgartner' +release = '0.5' + +# -- General configuration --------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration + +extensions = [ + 'sphinx_copybutton', + 'sphinx.ext.autodoc', + ] + +templates_path = ['_templates'] +exclude_patterns = [] + + + +# -- Options for HTML output ------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output + +html_theme = 'furo' +html_static_path = ['_static'] + +html_title = 'bsie' +html_theme_options = { + 'announcement': '<em>This project is under heavy development and subject to rapid changes. Use at your own discretion.</em>', + } + diff --git a/doc/source/index.rst b/doc/source/index.rst new file mode 100644 index 0000000..9cf06fe --- /dev/null +++ b/doc/source/index.rst @@ -0,0 +1,26 @@ + +Black Star Information Extraction +================================= + +A major advantage of the `Black Star File System (BSFS) <https://www.bsfs.io/bsfs/>`_ +is its ability to store various kinds of (meta)data associated with a file. +However, the BSFS itself is only a storage solution, +it does not inspect files or collect information about them. + +The Black Star Information Extraction (BSIE) package fills this gap by +extracting various kinds of information from a file and pushing that data to a BSFS instance. + +BSIE has the ability to process numerous file formats, +and it can turn various aspects of a file into usable information. +This includes metadata from a source file system, +metadata stored within the file, +and even excerpts or feature representations of the file's content itself. + +.. toctree:: + :maxdepth: 1 + + installation + architecture + api/modules + + diff --git a/doc/source/installation.rst b/doc/source/installation.rst new file mode 100644 index 0000000..ee6fadb --- /dev/null +++ b/doc/source/installation.rst @@ -0,0 +1,75 @@ + +Installation +============ + +You can install *bsie* via pip. BSIE comes with support for various file formats. +For this, it needs to install many external packages. BSIE lets you control +which of these you want to install. Note that if you choose to not install +support for some file types, BSIE will show a warning and skip them. +All other formats will be processed normally. +It is recommended to install *bsie* in a virtual environment (via ``virtualenv``). + +To install only the minimally required software, use:: + + pip install --extra-index-url https://pip.bsfs.io bsie + +To install all dependencies, use the following shortcut:: + + pip install --extra-index-url https://pip.bsfs.io bsie[all] + +To install a subset of all dependencies, modify the extras part (``[image, preview]``) +of the follwing command to your liking:: + + pip install --extra-index-url https://pip.bsfs.io bsie[image,preview] + +Currently, BSIE providesthe following extra flags: + +* image: Read data from image files. + Note that you may also have to install ``exiftool`` through your system's + package manager (e.g. ``sudo apt install exiftool``). +* preview: Create previews from a variety of files. + Note that support for various file formats also depends on what + system packages you've installed. You should at least install ``imagemagick`` + through your system's package manager (e.g. ``sudo apt install imagemagick``). + See `Preview Generator <https://github.com/algoo/preview-generator>`_ for + more detailed instructions. +* features: Extract feature vectors from images. + + + +License +------- + +This project is released under the terms of the 3-clause BSD License. +By downloading or using the application you agree to the license's terms and conditions. + +.. literalinclude:: ../../LICENSE + + +Source +------ + +Check out our git repository:: + + git clone https://git.bsfs.io/bsie.git + +You can further install *bsie* via the ususal `setuptools <https://setuptools.pypa.io/en/latest/index.html>`_ commands from your bsie source directory:: + + python setup.py develop + +For development, you also need to install some additional dependencies:: + + # extra packages for tests + pip install rdflib requests + + # code style discipline + pip install mypy coverage pylint + # external type annotations for pyyaml + pip install types-PyYAML + + # documentation + pip install sphinx sphinx-copybutton furo + + # packaging + pip install build + |