Open Source Development
The eScience Center provides a good guide on several aspects required for scientific open-source projects. Using this guide as a starting point, we discuss some of the issues necessary to provide an easy to use and maintainable software project at the PTB.
For a real-world example on the points dicussed below see the following links:
- PyThia: https://gitlab1.ptb.de/pythia/pythia/
- PyThia-Doc: https://pythia-uq.readthedocs.io/en/latest/
Table of Contents
Project Setup
Version Control
Different Websites to host a git
repository:
- PTB-GitLab: https://gitlab1.ptb.de/
- GitLab: https://gitlab.com
- Github (Google): https://github.com
- Bitbucket (Atlassian): https://bitbucket.org
Git Workflow
If a project is developed/maintained by more then one person, it is useful to determine a develepment strategy (branching model) to keep the history clean and readible.
Groups: If the project is not tied to a single person, or if multiple people are responsible to maintain the project, it might be useful to create a GitLab Group. This way the project itself is not tied to a specific user account and additional projects may be linked under the same group as well. (See the PyThia Group as an example.)
In particular, it might be useful to create a Git group for your working group. This way you can easily grant access to every Git repository in your workgroup to everyone of your workgroup. You can even define different roles, to e.g. prevent specific members (e.g. students) to push to certain branches (e.g. main branch of your PhD repository).
Branching models: A Git workflow or branching model is essentially fixing a way to create new branches and enforce certain criteria for different branches. This way everyone working with the code, either through usage or contribution, knows how to obtain a stable version of the software, checkout the latest development stages or add features in a reproducable way. There exist a multitude of common branching models, neither of them being better then the other. For examples, see either the GitHub flow model or the branching model of PyThia. A good overview of different git workflows can also be found on this website.
Here are some examples:
- If only you are working on your repository and if you ever only add new things, you can use only one branch (main) to build your code.
- If you want to always have working version of your code, you can work linearly on a
development
branch and merge that into themain
branch only if everything works. - If you work on different features simultaneously or want to test things out, you can create
feature-branches
, code on them and merge them into thedevelopment
branch once everything works in them.
Tips:
- You can "squash" commit messages of feature branches, if you are not interested in keeping every "Bugfix" commit message in your history.
- When working with (a lot of) different branches, make sure you can easily see which branch you are on. For example your can change the layout of your bash command line to show the current branch:
Git Commit Messages
To keep the Git history clean and understandable, it is best practice to follow a consistent commit style guide, as e.g. the one discussed in this blog post. As an overview on the common git commit message style, this cheat sheet might help a lot.
General rules:
- Each commit message consists of a header, a body and a footer.
- The header is mandatory and should not exceed 50 characters.
- The body and footer are optional, however mentioning an issue in the footer will close the issue automatically.
- Any line of the commit message cannot be longer than 72 characters! This allows the message to be easier to read on GitLab as well as in various git tools.
. gitignore
To keep only the relevant information in your repository, you should specify which files should not be tracked by Git
.
To do so, simply add a .gitignore
file to your repository root.
A .gitignore
can look something like this:
# general
__pycache__/
*.pyc
.ipynb_checkpoints/
# data and image storage directories
data/
img/
# configuration file
app/config/config.ini
Important:
Do not use jupyter notebooks
when implementing code used for version control, as Git
is only able to use diffs efficiently for plain text files.
Similarily, don't add data files (.npz
, .hdf5
, etc.), documents (.pdf
, .docx
, etc.) or images (.png
, .jpeg
, .svg
, etc.) to your git repository.
README and other Information
Each repository should include a README.md
file wich will be used automatically as a start page on Gitlab and others.
The README.md
should include information on the purpose of the repository, installation guides and general useful things.
Gitlab in particular uses some special commands to display source code, links and images (see Gitlab Flavoured Markdown).
If you make your repository publicly available, you definitely should include a LICENSE.txt
to specify under which conditions your code can and should be used by others.
Gitlab, Github and Bitbucket even give you suggestions for different licese files.
If many people are contributing to your repository, you should also think about including markdown files indicating how to raise issues or in which style the code and commiting to the repository work. Also a code of conduct might be a good idea, if your project is larger and publicly visible.
Finally, if other people's work is based on your code, you should include a CHANGELOG.md
file to keep every user up-to-date with the latest changes.
Code Release
Choose a way to determine different versions of your code, such as semantic versioning. This way users of your code know which version of your code they are working with, which helps with reproducability of complex code projects later on. Best practice is to use e.g. GitLab Tags to distinguish different versions of the code in the git history. With this, it is very easy to checkout a certain commit to install a specific version.
Coding (Python)
Coding Environment
Choose a coding invironment you are comfortable with. Optimally, your environment supports you while coding, i.e., has features like
- code completion
- advanced file navigation (easy to jump to functions in other files)
- live code suggestions (e.g., too see function names and input types)
- linting (format code automatically, show you if inputs have wrong type)
This can look something like this (in neovim):
Here are some suggestions for coding environments:
Tip: Try to learn at least the basic commands of vi/vim, as this editor is terminal based (no GUI required) and is preinstalled on any linux distribution. This lets you edit file on e.g., a server via ssh. Git also uses vim (or less) as a standard editor.
Code Formatting
Use a consistent coding style. Best practice is to adhere to industry standards such as the PEP-8 code conventions. The Google Python Style Guide uses the PEP-8 standard as well and gives a concise explaination of good practices as well. Optimally, use some kind of auto-formatting tool to enforce PEP-8 format.
Get your Code to run
There are several way to make your code available to other persons and devices. Depending on the scope and goal of your project, some options may be more reasonable then others. Here are some scenarios with suggestions:
1. Writing scientific code:
If you simply write some test scripts while relying on either common packages or packages developped by other groups, you should ensure that each required package (with the respective version) is available to users.
This allows others (or yourself) to easily setup all requirements to run your code.
The easiest way to do this is either using a pip requirements.txt
or, if you use Anaconda, a conda environment.yml
.
These files contain the package (version) information and the environment can simply be installed using one command line, e.g.,
conda env create --file environment.yml
An example environment.yml
file looks like this:
name: my_env
channels:
- conda-forge
- defaults
dependencies:
- ipython=8.4.*
- matplotlib=3.5.*
- pip=22.1.*
- python=3.9.*
- scipy=1.7.*
- pip:
- numpy==1.22.*
- pylint==2.14.*
Tip:
Even though you can "freeze" your current environment to create a snapshot of every package currently installed, it is better to add the packages to the environment.yml
file manually.
This way you can e.g., leave some specifics out (numpy==1.22.*
) to get the latest bugfixes.
More importantly, you can specify the packages that are necessary to run your code and don't force others to install every package on your machine on theirs as well.
Tip:
If you use conda, switch to Mamba as this uses the same syntax but is a lot faster.
You can install mamba by running conda install mamba -n base -c conda-forge
in your base environment.
2. Writing scientific code with own package:
Assume you have a repository with some utility functions and another one in which you implement some application. You can (locally) import the utility package, but communicating this to others is difficult. Moreover, if you need to change something in the utility repository while working on an application, you need to manage multiple git repositories. For this case, you can use Git Submodules to integrate a snapshot of the utility repository into you application one. Simply navigate to the submodule (sub-directory) and start editing/commiting your files there. It is basically your git-repo inside another git repo.
Tip: You should still use an environment file to track package versions of other packages. And you should specify with which commit of the utility repository your application repository expects to work.
3. Writing a public package:
If you are writing a library/package for others to use (like PyThia), you can use the setuptools
package and a setup.py
script to enable installation of your code via pip
.
You can specify a version, description, author, copy right and package versions this way and installation becomes as easy as calling
pip install .
from the directory the setup.py
script is located in.
This installs your package into the general environment, so that you can import the package from any location on your device.
An example setup script can look like this:
import setuptools
with open("README.md", "r") as fh:
long_description = fh.read()
setuptools.setup(
name="pythia-uq",
version="2.0.0",
author="Nando Farchmin",
author_email="nando.farchmin@ptb.de",
description=("Package for solving inverse problems and quantifying their "
+ "uncertainties via general polynomial chaos."),
long_description=long_description,
long_description_content_type="text/markdown",
url="https://gitlab1.ptb.de/pythia/pythia",
packages=setuptools.find_packages(),
classifiers=[
"Programming Language :: Python :: 3",
"Operating System :: OS Independent",
],
install_requires=[
"numpy>=1.20.0",
"scipy>=1.5.0",
"psutil>=5.0",
"sphinx-autodoc-typehints>=1.18.1",
],
)
Tip:
While developing the package, don't rely on the pip installation process, as you need to install/update the package everytime with pip if you change anything.
For development, simply add the path to the repository locally to your PYTHONPATH
.
Tip:
If you want to increase accessibility even further, you can upload the package to PyPI.org.
This way everyone can install you package using pip install <package-name>
via the internet, no need to clone the repository.
Code structure
You should split the code doing the actual work from the main files/applications you need to run.
Here is an example for a reasonable code package structure:
Testing (Python)
It is very important to ensure that your code is doing exactly what it should do, especially if you write scientific code that is very complex.
PyTest
The native way to ensure the core functionality of your functions, classes and methods is using pytest
(or any other Python testing module).
The pytest website explains the basic workings very well.
Tip:
Using unit tests to check the functionality of your code can be used for development as well.
In test-driven development
, you first specify in a test the things your functions or class should be able to do and then start to write the actual thing.
This way you can ensure that the code you write does exactly what you wanted from it in the beginning.
CI / CD
Another step would be to include CI/CD (continuous integration and continuous development) into your repository.
What this essentially does is specifying tasks that are run everytime you push your changes to the repository server.
A typical application is running your unit tests to ensure that you did not change core functionality of your code after editing it, but you can do other things such as creating auto-doc or updating webpages as well.
The setup is very easy and basically built-in into Gitlab, Github and Bitbucket.
You simply need to include a .gitlab-ci.yml
file (for Gitlab) in your repository root directory which looks something like this
stages:
- build
- test
- deploy
before_script:
- export HTTPS_PROXY="webproxy.bs.ptb.de:8080"
build-job: # check if installation of pythia works
stage: build
image: python:3.8
script:
- pip install .
unit-test-job: # run unit tests
stage: test
image: python:3.8
script:
- pip install pytest pytest-cov
- pip install .
- python -m pytest --cov-report=html --cov=pythia .
artifacts:
paths:
- coverage
expire_in: 30 days
deploy-job:
stage: deploy
script:
- echo "Not implemented yet."
and you're done. If multiple people work on the same repository (e.g., students or other group members), you can enable that merging a commit is only done if the pipeline succeeds. This way nobody can produce code that is doing unexpected things.
Documentation (Python)
Doc-Strings
Always use doc-strings to document functions, classes and modules. To be able to generate a documentation of your code automatically, keep doc-string standards such as the Numpy Doc Style Guide. The Numpy style guide in particular is very suitable to document scientific code. A practical example of the Numpy Doc style can be found here.
Tip:
Using docstrings also can help you directly through your IDE.
Here is an example:
Auto-Doc with Sphinx
If you plan to write a software package that should be used by others or if you need to provide a documentation for your code as a deliverable for a project, you should think about creating the documentation automatically. This way the documentation is always up to date with your code and you don't need to do anything.
Setting up an auto-doc with, for example Sphinx can be a little tricky, but in principle this is very easy. If you want to get started, simply read the Sphinx quickstart guide.
This is a very nice feature as you cannot only produce a documentation based on your docstrings, but can also include additional pages with tutorials or a description on the setup of your project. For a demo of an auto-doc, you can checkout the PyThia documentation. Of course you can also look into the doc-source files of PyThia, as they are tracked in the repository as well.
Host Documentation online
Being able to create the documentation either as html or pdf locally is nice, but directly hosting the documentation online for everyone to access cooler by far. As long as you're writing open-source non-profit code (which you should in science!), readthedocs.org can host your documentation online for free. You can create an account there and simply link to your repository. Read the Docs will even create different versions of the documentation based on the commits of your Git repository, i.e., a "stable" and a "latest" version you can specify the branches of as well as versions for different tags (v1.2.17, v2.0.1) of your project. This way, even if somebody uses an older version of your code, they still can access the correct documentation.
All you need to do from the git side is include a .readthedocs.yml
file in your repository root that looks something like this:
# Required
version: 2
# Set the version of Python and other tools you might need
build:
os: ubuntu-20.04
tools:
python: "3.8"
# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/source/conf.py
# If using Sphinx, optionally build your docs in additional formats such as PDF
formats:
- pdf
- epub
# Optionally declare the Python requirements required to build your docs
python:
install:
- method: setuptools
path: .
Tip:
Using Read the Docs with the PTB Gitlab (gitlab1.ptb.de
) is not possible (afaik) due to the firewall proxy settings.
A workaround is mirroring the repository on e.g., Gitlab.
Doing this unidirectional allows Gitlab to pull updates from the original repository on the PTB Gitlab approximately every 30 minutes automatically.
Then you can use this mirror repository to create the documentation on Read the Docs.
It might be clever to disable CI/CD in the mirror repo, as CI/CD times are limited for free users.