Metadata-Version: 2.4
Name: fmrib-unpack
Version: 4.0.1
Summary: The FMRIB UKBiobank Normalisation, Parsing And Cleaning Kit
Author-email: Paul McCarthy <paul.mccarthy@ndcn.ox.ac.uk>, Thomas Nichols <tom.nichols@bdi.ox.ac.uk>, Jessie Liu <Zhangdaihong.Liu@warwick.ac.uk>, Steve Smith <steve@fmrib.ox.ac.uk>
License: Apache License Version 2.0
Project-URL: Repository, https://git.fmrib.ox.ac.uk/fsl/funpack.git
Project-URL: Documentation, https://open.win.ox.ac.uk/pages/fsl/funpack/
Requires-Python: >=3.8
Description-Content-Type: text/x-rst
License-File: LICENSE
Requires-Dist: fmrib-unpack-fmrib-config
Requires-Dist: h5py
Requires-Dist: numpy
Requires-Dist: pandas>=0.24
Requires-Dist: pyparsing
Requires-Dist: tables
Requires-Dist: threadpoolctl
Requires-Dist: typing-extensions
Provides-Extra: demo
Requires-Dist: bash_kernel; extra == "demo"
Requires-Dist: jupyter; extra == "demo"
Requires-Dist: nbclassic; extra == "demo"
Requires-Dist: notebook; extra == "demo"
Requires-Dist: pygments; extra == "demo"
Provides-Extra: test
Requires-Dist: coverage; extra == "test"
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: jinja2; extra == "test"
Requires-Dist: jupyter_server; extra == "test"
Requires-Dist: nbconvert; extra == "test"
Requires-Dist: sphinx; extra == "test"
Requires-Dist: sphinx_rtd_theme; extra == "test"
Requires-Dist: sphinx-rtd-dark-mode; extra == "test"
Requires-Dist: nbsphinx; extra == "test"
Requires-Dist: pylint; extra == "test"
Requires-Dist: flake8; extra == "test"
Dynamic: license-file

**FUNPACK** - the FMRIB UK Biobank Normalisation, Parsing And Cleaning Kit
==========================================================================


.. image:: https://img.shields.io/pypi/v/fmrib-unpack.svg
   :target: https://pypi.python.org/pypi/fmrib-unpack/


.. image:: https://anaconda.org/conda-forge/fmrib-unpack/badges/version.svg
   :target: https://anaconda.org/conda-forge/fmrib-unpack


.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.1997626.svg
   :target: https://doi.org/10.5281/zenodo.1997626


.. image:: https://git.fmrib.ox.ac.uk/fsl/funpack/badges/main/coverage.svg
   :target: https://git.fmrib.ox.ac.uk/fsl/funpack/


**FUNPACK** is a Python application for pre-processing of UK Biobank phenotype
data.

    FUNPACK is developed at the Oxford Centre for Integrative Neuroimaging
    (OxCIN@FMRIB), University of Oxford. FUNPACK is in no way endorsed,
    sanctioned, or validated by the `UK Biobank
    <https://www.ukbiobank.ac.uk/>`_.


Installation
------------


Install FUNPACK from ``conda-forge``::

    conda install -c conda-forge fmrib-unpack


Or using ``pip``::

    pip install fmrib-unpack



The FUNPACK source code can be found at
https://git.fmrib.ox.ac.uk/fsl/funpack/.


Introductory notebook
---------------------


The ``fmrib_unpack_demo`` command will start a Jupyter Notebook which
introduces the main features provided by FUNPACK. A non-interactive version of
this notebook can be found `here <demo.ipynb>`_.

If you are using ``pip``, you need to install a few additional dependencies::

    pip install fmrib-unpack[demo]


You can then start the demo by running ``fmrib_unpack_demo``.


.. note:: The introductory notebook uses ``bash``, so is unlikely to work on
          Windows (unless you are using the Windows Subsystem for Linux).


Usage
-----


General usage is as follows::

    fmrib_unpack [options] output.tsv input1.tsv input2.tsv


You can get information on all of the options by typing ``fmrib_unpack --help``.

    The ``fmrib_unpack`` command was called ``funpack`` in older versions of
    FUNPACK, but was changed to ``fmrib_unpack`` in 3.0.0 to avoid a naming
    conflict with an `unrelated software package
    <https://heasarc.gsfc.nasa.gov/fitsio/>`_.


Options can be specified on the command line, and/or stored in a configuration
file. For example, the options in the following command line::

    fmrib_unpack                     \
      --overwrite                    \
      --write_log                    \
      --icd10_map_file icd_codes.tsv \
      --category 10                  \
      --category 11                  \
      output.tsv input1.tsv input2.tsv


Could be stored in a configuration file ``config.txt``::

    overwrite
    write_log
    icd10_map_file icd_codes.tsv
    category       10
    category       11


And then executed as follows::

    fmrib_unpack -cfg config.txt output.tsv input1.tsv input2.tsv


Features
--------


FUNPACK allows you to perform various data sanitisation and processing steps
on your data, such as:

 * **NA value replacement**: Specific values for some data-fields can be
   replaced with NA, for example, data-fields where a value of -1 indicates
   *Do not know*.

 * **Categorical recoding**: Certain categorical data-fields can re-coded. For
   example, data-fields where a value of 555 represents *half* can be recoded
   so that 555 is replaced with 0.5.

 * **Child value replacement**: NA values within some data-fields which are
   dependent upon other data-fields may have values inserted based on the
   values of their parent data-fields.


See the `overview <demo.ipynb>`_ for a more comprehensive overview of the
features available in FUNPACK.


Built-in rules
--------------


FUNPACK contains a large number of built-in rules which have been specifically
written to pre-process UK Biobank data-fields (also referred to as
*variables*). These rules are stored in the following files [*]_:

 * ``funpack/configs/fmrib/datacodings_*.tsv``: Cleaning rules for
   data-codings
 * ``funpack/configs/fmrib/variables_*.tsv``: Cleaning rules for individual
   data-fields
 * ``funpack/configs/fmrib/processing.tsv``: Processing steps
 * ``funpack/configs/fmrib/categories.tsv``: Data-field categories

.. [*] The ``funpack/`` directory is located in
        ``<python-env>/lib/python<X.Y>/site-packages/funpack/``, where
        ``<python-env>`` is the location of your Python environment directory,
        and ``<X.Y>`` is the Python version you are using.


You can use these rules by using the FMRIB configuration profile::

    fmrib_unpack -cfg fmrib output.tsv input.tsv


You can customise or replace these files as you see fit. You can also pass
your own versions of these files to FUNPACK via the ``--variable_file``,
``--datacoding_file``, ``--type_file``, ``--processing_file``, and
``--category_file`` command-line options respectively. FUNPACK will load all
data-field and data-coding files, and merge them into a single table which
contains the cleaning rules for each data-field.


FUNPACK also comes bundled with a copy of the UK Bioobank schema, containing
metadata about all UK Biobank data-fields.  The schema can be obtained from
the `UK Biobank online data showcase
<https://biobank.ctsu.ox.ac.uk/showcase/index.cgi>`_


.. note:: The ``fmrib`` configuration profile is managed and released
          separately from FUNPACK `here
          <https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config/>`_. However,
          it is automatically installed alongside FUNPACK, so if you have
          FUNPACK, you can use the ``fmrib`` profile. If you are using FUNPACK
          from a source checkout, you may need to manually install the
          ``fmrib-unpack-fmrib-config`` package from `PyPi
          <https://pypi.org/project/fmrib-unpack-fmrib-config/>`_ or
          `conda-forge
          <https://anaconda.org/conda-forge/fmrib-unpack-fmrib-config>`_.


Creating your own rule files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^


To define rules at the *data-coding* level, create one or more ``.tsv`` files
with an ``ID`` column containing the data-coding ID, and any of the following
columns:


  - ``NAValues``: A comma-separated list of values to replace with NA
  - ``RawLevels`` A comma-separated list of values to be replaced with
    corresponding values in ``NewLevels``.
  - ``NewLevels`` A comma-separated list of replacement values for each
    of the values listed in ``RawLevels``.

To apply these rules, pass your ``.tsv`` file(s) to ``funpack`` with the
``--datacoding_file`` option. They will be applied to all data-fields which
use the data-coding(s) listed in the file(s).


To define rules at the *data-field* level, create one or more ``.tsv`` files
with an ``ID`` column containing the data-field ID, and any of the following
columns:


  - ``NAValues``: As above
  - ``RawLevels`` As above
  - ``NewLevels`` As above
  - ``ParentValues``: A comma-separated list of expressions on parent
    data-field, defining conditions which should trigger child-value
    replacement.
  - ``ChildValues``: A comma-separated list of values to insert into the
    data-field when the corresponding expression in ``ParentValues`` evaluates
    to true.
  - ``Clean``: A comma-separated list of cleaning functions to apply to the
    data-field.


Output
------


The main output of FUNPACK is a plain-text file [*]_ which contains the input
data, after cleaning and processing, potentially with some columns removed,
and new columns added.


If you used the ``--suppress_non_numerics`` option, the main output file will
only contain the numeric columns. You can combine this with the
``--write_non_numerics`` option to save non-numeric columns to a separate
file.


You can use any tool of your choice to load this output file, such as Python,
MATLAB, or Excel. It is also possible to pass the output back into
FUNPACK.


.. [*] If your output file name ends with ``.csv``, the file will be
       comma-separated, and if your output file name ends with ``.tsv``, the
       file will be tab-separated.


Tests
-----


To run the test suite, you need to install some additional dependencies::

    pip install fmrib-unpack[test]


Then you can run the test suite using ``pytest``::

    pytest


macOS issues
------------


FUNPACK makes extensive use of the Python `multiprocessing
<https://docs.python.org/3/library/multiprocessing.html>`_ module to speed up
certain steps in its processing pipeline.  FUNPACK relies on the POSIX `fork()
<https://www.man7.org/linux/man-pages/man2/fork.2.html>`_ mechanism, so that
worker processes may inexpensively inherit the memory space of the main
process (often referred to as *copy-on-write*).  This is to avoid having to
serialise the data set being processed (stored internally as a
``pandas.DataFrame``).


In python 3.8 on macOS, the default method used by the ``multiprocessing``
module was changed from ``fork`` to ``spawn``, due to changes in macOS 10.13
restricting the use of ``fork()`` for safety reasons. Some background
information on this change can be found at https://bugs.python.org/issue33725,
and at `this blog post
<https://wefearchange.org/2018/11/forkmacos.rst.html>`_.


FUNPACK therefore explicitly sets the method used by the ``multiprocessing``
to ``fork``, to take advantage of copy-on-write semantics.  Using ``fork()``
on macOS *should* be safe for single-threaded parent processes, but as FUNPACK
calls ``fork()`` numerous times (by creating and discarding
``multiprocessing.Pool()`` objects on an as-needed basis), this assumption may
not be valid, and FUNPACK may crash with an error message resembling the
following::


    +[SomeClass initialize] may have been in progress in another thread
    when fork() was called. We cannot safely call it or ignore it in the
    fork() child process. Crashing instead.


You might be able to work around this error by setting an environment variable
before calling FUNPACK, like so::


    export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
    fmrib_unpack ...


Citing
------


If you would like to cite FUNPACK, please refer to its `Zenodo page
<https://doi.org/10.5281/zenodo.1997626>`_.
