# Architecture of repo2docker This is a living document talking about the architecture of repo2docker from various perspectives. ## Buildpack The **buildpack** concept comes from [Heroku](https://devcenter.heroku.com/articles/buildpacks) and Ruby on Rails' [Convention over Configuration](http://rubyonrails.org/doctrine/#convention-over-configuration) doctrine. Instead of the user specifying a complete specification of exactly how they want their environment to be, they can focus only on how their environment differs from a conventional environment. This means instead of deciding 'should I get Python from Apt or pyenv or ?', user can just specify 'I want python-3.6'. Usually, specifying a **runtime** and list of **libraries** with explicit **versions** is all that is needed. In repo2docker, a Buildpack does the following things: 1. **Detect** if it can handle a given repository 2. **Build** a base language environment in the docker image 3. **Copy** the contents of the repository into the docker image 4. **Assemble** a specific environment in the docker image based on repository contents 5. **Push** the built docker image to a specific docker registry (optional) 6. **Run** the build docker image as a docker container (optional) ### Detect When given a repository, repo2docker first has to determine which buildpack to use. It takes the following steps to determine this: 1. Look at the ordered list of `BuildPack` objects listed in `Repo2Docker.buildpacks` traitlet. This is populated with a default set of buildpacks in most-specific-to-least-specific order. Other applications using this can add / change this using traditional [traitlet](http://traitlets.readthedocs.io/en/stable/) configuration mechanisms. 2. Calls the `detect` method of each `BuildPack` object. This method assumes that the repository is present in the current working directory, and should return `True` if the repository is something that it should be used for. For example, a `BuildPack` that uses `conda` to install libraries can check for presence of an `environment.yml` file and say 'yes, I can handle this repository' by returning `True`. Usually buildpacks look for presence of specific files (`requirements.txt`, `environment.yml`, `install.R`, `manifest.xml` etc) to determine if they can handle a repository or not. Buildpacks may also look into specific files to determine specifics of the required environment, such as the Stencila integration which extracts the required language-specific executions contexts from an XML file (see base `BuildPack`). More than one buildpack may use such information, as properties can be inherited (e.g. the R buildpack uses the list of required Stencila contexts to see if R must be installed). 3. If no `BuildPack` returns true, then repo2docker will use the default `BuildPack` (defined in `Repo2Docker.default_buildpack` traitlet). ## Build base environment Once a buildpack is chosen, it builds a **base environment** that is mostly the same for various repositories built with the same buildpack. For example, in `CondaBuildPack`, the base environment consists of installing [miniconda](https://conda.io/miniconda.html) and basic notebook packages (from `repo2docker/buildpacks/conda/environment.yml`). This is going to be the same for most repositories built with `CondaBuildPack`, so we want to use [docker layer caching](https://thenewstack.io/understanding-the-docker-cache-for-faster-builds/) as much as possible for performance reasons. Next time a repository is built with `CondaBuildPack`, we can skip straight to the **copy** step (since the base environment docker image *layers* have already been built and cached). The `get_build_scripts` and `get_build_script_files` methods are primarily used for this. `get_build_scripts` can return arbitrary bash script lines that can be run as different users, and `get_build_script_files` is used to copy specific scripts (such as a conda installer) into the image to be run as pat of `get_build_scripts`. Code in either has following constraints: 1. You can *not* use the contents of repository in them, since this happens before the repository is copied into the image. For example, `pip install -r requirements.txt` will not work, since there's no `requirements.txt` inside the image at this point. This is an explicit design decision, to enable better layer caching. 2. You *may*, however, read the contents of the repository and modify the scripts emitted based on that! For example, in `CondaBuildPack`, if there's Python 2 specified in `environment.yml`, a different kind of environment is set up. The reading of the `environment.yml` is performed in the BuildPack itself, and not in the scripts returned by `get_build_scripts`. This is fine. BuildPack authors should still try to minimize the variants created in this fashion, to optimize the build cache. ## Copy repository contents The contents of the repository are copied unconditionally into the Docker image, and made available for all further commands. This is common to most `BuildPack`s, and the code is in the `build` method of the `BuildPack` base class. ## Assemble repository environment The **assemble** stage builds the specific environment that is requested by the repository. This usually means installing required libraries specified in a format native to the language (`requirements.txt`, `environment.yml`, `REQUIRE`, `install.R`, etc). Most of this work is done in `get_assemble_scripts` method. It can return arbitrary bash script lines that can be run as different users, and has access to the repository contents (unlike `get_build_scripts`). The docker image layers produced by this usually can not be cached, so less restrictions apply to this than to `get_build_scripts`. At the end of the assemble step, the docker image is ready to be used in various ways! ## Push Optionally, repo2docker can **push** a built image to a [docker registry](https://docs.docker.com/registry/). This is done as a convenience only (since you can do the same with a `docker push` after using repo2docker only to build), and implemented in `Repo2Docker.push` method. It is only activated if using the `--push` commandline flag. ## Run Optionally, repo2docker can **run** the built image and allow the user to access the Jupyter Notebook running inside by default. This is also done as a convenience only (since you can do the same with `docker run` after using repo2docker only to build), and implemented in `Repo2Docker.run`. It is activated by default unless the `--no-run` commandline flag is passed.