Colibri Scripts

In this tutorial we will discuss what the general structure of a colibri fit folder is and some of the scripts that are available in colibri.

Colibri fit folders

A colibri fit folder is the folder resulting from a colibri-model fit. It is essentially a folder containing a set of relevant information for the fit. Currently, we distinguish between two types of fit folders: Bayesian fit folders and Monte Carlo replica fit folders.

Bayesian fit folders

Note

By “Bayesian fit folder” we mean a folder containing the results of a fit performed with a Bayesian sampling method (see this section for details on how to run a Bayesian fit).

Any Bayesian fit folder should contain the following files:

colibri_fit/
├── replicas/         # Folder containing replica sub‐folders (one per replica) with exportgrid files.
├── pdf_model.pkl     # pickled PDF model used for the fit
├── input/            # directory of input data and runcard(s)
├── filter.yml        # YAML file: copy of the input runcard
├── md5               # checksum file to verify integrity of the fit folder
├── bayes_metrics.csv
└── full_posterior_sample.csv

The replicas folder contains the subfolders of the replicas that were used in the fit. Each of these folders contains an .exportgrid file, which can be interpreted as a sample from the posterior distribution of the PDF model. The pdf_model.pkl file contains the pickled PDF model used for the fit. This file can be used for several purposes,an example is that of using it to resample from the posterior distribution of the PDF model when a Bayesian fit is performed (See also colibri.scripts.ns_resampler). The other files are the input data and the filter file, which is a copy of the input runcard used for the fit. The md5 file is a checksum file that can be used to verify the integrity of the fit folder. The bayes_metrics.csv file contains the metrics of the fit, such as the log-likelihood and the evidence. The full_posterior_sample.csv file contains the full posterior sample of the fit (whose size is specified in the runcard).

Depending on the type of Bayesian fit, other files may be present, for example a fit done using the ultranest will contain the following extra files:

ultranest_colibri_fit/
├── ultranest_logs/
├── ns_result.csv

While a fit done using the analytic_fit module will contain the extra following file:

analytic_colibri_fit/
├── analytic_result.csv

MC replica fit folders

Note

By “MC replica fit folder” we mean a folder containing the results of a fit performed with a Monte Carlo replica method (See [CMMM24] for more details on this method.).

A MC replica fit folder should have the following structure:

mc_replica_fit/
├── fit_replicas/     # Folder containing replica sub-folders (one per replica) with exportgrid files.
├── pdf_model.pkl     # pickled PDF model used for the fit
├── input/            # directory of input data and runcard(s)
├── filter.yml        # YAML file: copy of the input runcard
└── md5               # checksum file to verify integrity of the fit folder

where the fit_replicas folder contains the subfolders of the replicas that were used in the fit.

Finding the \(\chi^2\) of a Monte Carlo fit

The \(\chi^2\) for each replica of your Monte Carlo fit will be stored in the fit_replicas/replica_n/mc_loss.csv file, where n is the specific replica number. This file lists the training and validation losses for every 50 epochs. For example, the first few lines would look like this:

epochs,training_loss,validation_loss
0,7.80859e+00,1.13569e+01
1,6.19384e+00,9.22697e+00
2,4.86740e+00,7.44600e+00
...

which would represent the losses for the first 150 epochs (i.e. 0, 1, 2 are just labels).

Postfit selection

The fit_replicas is used by the colibri.scripts.mc_postfit script to perform a postfit selection of the replicas. The postfit script also takes care of creating the replicas folder, which is the one needed for the evolution of the fit.

You can therefore run a postfit selection of the replicas by running:

mc_postfit -c CHI2_THRESHOLD monte_carlo_output_directory

where the -c `` is optional and ``CHI2_THRESHOLD is a number that determines the \(\chi^2\) threshold above which a Monte Carlo replica will be rejected. This can also be run as --chi2_threshold instead of -c. If no value is specified, a default value of 1.5 will be applied.

Other options are:

--nsigma NSIGMA: The nsigma threshold above which replicas are rejected. The default is 5.
--target_replicas TARGET_REPLICAS or -t TARGET_REPLICAS: The target number of replicas to be produced by postfit. The default is 100.

Evolution script

The evolution script of colibri is a wrapper around the evolven3fit script (See the colibri.scripts.evolve_fit module’s and colibri.scripts.evolve_fit.main() function.) that only allows for the evolve option.

It can be executed from the command line as follows:

evolve_fit <name_fit>

where <name_fit> is the name of the fit you want to evolve. The script also has a --help option that will show you all the options available. For more information on the evolution see also the helper from the evolven3fit script.

Postfit emulation

For Bayesian fits we don’t do any postfit selection on the posterior, however, for backwards compatibility with the validphys module we still run a postfit emulation which takes care of creating the central replica and a postfit folder containing the evolved replicas as well as the corresponding LHAPDF set.

Upload of the fit

After running the evolution script, it is possible (if the user has the right permissions) to simply upload the fit to the validphys server using the validphys script

vp-upload <name_fit>

After which the fit can be installed and made available in the environment with the command

vp-get fit <name_fit>

If the user does not have the right permissions it is recommended to simply symlink the lhapdf set to the lhapdf environment folder or to symlink the fit folder to the NNPDF/results folder of the environment.

Note

The final folder after the evolution will also contain a symlink nnfit -> replicas needed for validphys and evolven3fit as well as a postfit folder.

Resampling script

In a Colibri fit runcard, you control how many posterior samples get written out as .exportgrid files in the replicas/ folder — and those can subsequently be evolved into a PDF set.

For a Bayesian fit using the analytical - inference method, set the total number of posterior draws via the analytic_settings block. For example:

# Analytic settings
analytic_settings:
  n_posterior_samples: 100
  full_sample_size: 50000

Likewise, if you instead use the UltraNest nested sampler, specify exactly the same parameter name under ultranest_settings:

# ultranest settings
ultranest_settings:
  n_posterior_samples: 100
  ...

Key Parameters

n_posterior_samples: The number of individual posterior draws that will each be written out as a separate .exportgrid file in the replicas/ folder.
full_sample_size (analytic only) : The total size of the merged posterior sample, which is saved to full_posterior_sample.csv at the top level of your fit directory.

Note

In the case of a fit done using the ultranest nested sampling sampler, the full_sample_size defaults to an internal number that might depends on the specific run.

If you want to draw additional replicas (or have a smaller set for a finite-size effects studies) from the posterior distribution of an already‐completed PDF fit, you do not need to re‐run the full fit. Instead, use the resample_fit helper script.

Usage

To see all available options, invoke:

$ resample_fit --help

This will print out a help message that looks like this:

usage: resample_fit [-h] [--fitype FITYPE] [--nreplicas NREPLICAS] [--resampling_seed RESAMPLING_SEED]
                    [--resampled_fit_name RESAMPLED_FIT_NAME] [--parametrisation_scale PARAMETRISATION_SCALE]
                    fit_name

Script to resample from Bayesian posterior

positional arguments:
  fit_name              The colibri fit from which to sample.

options:
  -h, --help            show this help message and exit
  --fitype FITYPE, -t FITYPE
                        The type of fit to be resampled. Currently only `ultranest` and `analytic` are supported.
  --nreplicas NREPLICAS, -nrep NREPLICAS
                        The number of samples.
  --resampling_seed RESAMPLING_SEED, -seed RESAMPLING_SEED
                        The random seed to be used to sample from the posterior.
  --resampled_fit_name RESAMPLED_FIT_NAME, -newfit RESAMPLED_FIT_NAME
                        The name of the resampled fit.
  --parametrisation_scale PARAMETRISATION_SCALE, -Q PARAMETRISATION_SCALE
                        The scale at which the PDFs are fitted.

As an example, if we want to resample from the posterior distribution of an analytical fit called my_fit we can do it as follows:

resample_fit my_fit -t analytic -n 100 -seed 1234 -newfit my_resampled_fit

Note

Importantly, in order to resample from the posterior distribution of a fit, you need to be in the same environment as the one used to perform the fit. Hence, if you want to resample a fit done using the les-houches PDF model, you need to be in the environment where the les_houches_exe exectuable is available.