{% extends "base.html" %} {% block title %} About the Knowledge Repo {% endblock %} {% block content %} {{ super() }}

The Knowledge Repository

Introduction

The knowledge repository is a git repository, web app, and set of tools that enables the sharing of knowledge between data scientists and other technical roles. The idea is to act as a hub whereby people can submit their work, and go through a standard QA review process, using data formats that make sense in these professions. Currently, the knowledge repository supports the following formats:

Users add these notebooks to the knowledge repository through the knowledge_repo tool, as described below, which converts them into a standard format; and allows them to be rendered and curated in the web app.

Getting started

Installation

To install the knowledge repository tooling, simply run:

pip install git+ssh://git@github.com/airbnb/knowledge-repo.git

Setup

If your organization already has a knowledge data repository setup, check it out onto your computer as you normally would; for example:

git clone git@example.com:example_data_repo.git

If not, or for fun, you can create a new knowledge repository using:

knowledge_repo --repo <repo_path> init

Running this same script if a repo already exists at <repo_path> will have no effect.

You can drop the --repo option if you set the $KNOWLEDGE_REPO environment variable to the location of that repository.

For more details about the structure of a knowledge repository, see the technical details section below.

Adding knowledge

The whole point of a knowledge repository is to host knowledge posts. You can add a knowledge post using:

knowledge_repo --repo <repo_path> add <supported knowledge format> <location in knowledge repo>

For example, if my knowledge repository is in a folder named test_repo, and I have an IPython notebook at Documents/notebook.ipynb, and I want it to be added to the knowledge repository under projects/test_knowledge, I can run:

knowledge_repo --repo test_repo add Documents/notebook.ipynb projects/test_knowledge

If you look in test_repo you will see a new folder test_repo/projects/test_knowledge.kp, and you are set to use git commit and git push to submit it for review. Note that the folder ends in ‘.kp’. This is added automatically to indicate that this folder is a knowledge post. Explicitly adding the ‘.kp’ is optional. Also note that knowledge_repo does not automatically create a branch for you; so if that is the way in which your organisation works, be sure to manually create a branch before pushing into the repo.

Currently ‘ipynb’, ‘Rmd’ and ‘md’ files are supported. See the “Contributing” section below to see how to add support for more formats.

To update an existing knowledge post, simply pass the --update option, which will allow the add operation to override existing knowledge posts. e.g.

knowledge_repo --repo <repo_path> add --update <supported knowledge format> <location in knowledge repo>

Running the web app

Running the web app allows you to locally view all the knowledge posts in the repository, or to serve it for others to view. It is also useful when developing on the web app.

Running the development server

Running the web app in development/local/private mode is as simple as running:

knowledge_repo --repo <repo_path> runserver

Supported options are --port and --dburi which respectively change the local port on which the server is running, and the sqlalchemy uri where the database can be found and/or initiated. The default port is 7000, and the default dburi is sqlite:////tmp/knowledge.db. If the database does not exist, it is created (if that is possible) and initialised. Database migrations are not automatic (to prevent accidental data loss), but can be performed using:

knowledge_repo --repo <repo_path> db_migrate --dburi <db>

Deploying the web app

Deploying the web app is much like running the development server, except that the web app is deployed on top of gunicorn. It also allows for enabling server-side components such as sending emails to subscribed users.

Deploying is as simple as: knowledge_repo --repo <repo_path> deploy

Supported options are --port, --dburi,--workers, --timeout and --config. The --config option allows you to specify a python config file from which to load the extended configuration. A template config file is provided in resources/server_config.py. The --port and --dburi options are as before, with the --workers and --timeout options specifying the number of threads to use when serving through gunicorn, and the timeout after which the threads are presumed to have died, and will be restarted.

Contributing

We would love to work with you to create the best knowledge repository software possible. If you have ideas or would like to have your own code included, add an issue or pull request and we will review it.

Adding new filetype support

Support for conversion of a particular filetype to a knowledge post is added by writing a new KnowledgePostConverter object. Each converter should live in its own file in knowledge_repo/converters. Refer to the implementation for ipynb, Rmd, and md for more details. If your conversion is site-specific, you can define these subclasses in .knowledge_repo_config, whereupon they will be picked up by the conversion code.

Adding extra structure and/or verifications to the knowledge post conversion process

When a KnowledgePost is constructed by converting from support filetypes, the resulting post is then passed through a series of postprocessors (defined in knowledge_repo/postprocessors). This allows one to modify the knowledge post, upload images to remote storage facilities (such as S3), and/or verify some additional structure of the knowledge posts. As above, defining these classes in .knowledge_repo_config allows for postprocessors to be used locally.

More

Is the Knowledge Repository missing something else that you would like to see? Let us know, and we’ll see if we cannot help you.

Technical Details

What is a Knowledge Repository

A knowledge repository is a git repository with the following structure:

<repo>
        + .git  # The git repository metadata
        + .resources  # A folder into which the knowledge_repo repository is checked out (as a git submodule)
        - .knowledge_repo_config  # Local configuration for this knowledge repository
        - <knowledge posts>
    

The use of a git submodule to checkout the knowledge_repo into .resources allows use to ensure that the client and server are using the same version of the code. When one uses the knowledge_repo script, it actually passes the options to the version of the knowledge_repo script in .resources/scripts/knowledge_repo. Thus, updating the version of knowledge_repo used by client and server alike is as simple as changing which revision is checked out by git submodule in the usual way. That is:

pushd .resources
    git pull
    git checkout <revision>/<branch>
    popd
    git commit -a -m 'Updated version of the knowledge_repo'
    git push
    

Then, all users and servers associated with this repository will be updated to the new version. This prevents version mismatches between client and server, and all users of the repository.

In development, it is often useful to disable this chaining. To use the local code instead of the code in the checked out knowledge repository, pass the --dev option as:

knowledge_repo --repo <repo_path> --dev <action> ...

What is a Knowledge Post?

A knowledge post is a directory, with the following structure:

<knowledge_post>
        - knowledge.md
        + images/* [Optional]
        + orig_src/* [Optional; stores the original converted file]
    

Images are automatically extracted from the local paths on your computer, and placed into images. orig_src contains the file(s) from which the knowledge post was converted from.


{% endblock %} {% block scripts %} {{ super() }} {% endblock %}