Colibri Scripts
In this tutorial we will discuss what the general structure of a colibri fit folder is and some of the scripts that are available in colibri.
Colibri fit folders
A colibri fit folder is the folder resulting from a colibri-model fit. It is essentially a folder containing a set of relevant information for the fit. Currently, we distinguish between two types of fit folders: Bayesian fit folders and Monte Carlo replica fit folders.
Bayesian fit folders
Note
By “Bayesian fit folder” we mean a folder containing the results of a fit performed with a Bayesian sampling method (see this section for details on how to run a Bayesian fit).
Any Bayesian fit folder should contain the following files:
colibri_fit/
├── replicas/ # Folder containing replica sub‐folders (one per replica) with exportgrid files.
├── pdf_model.pkl # pickled PDF model used for the fit
├── input/ # directory of input data and runcard(s)
├── filter.yml # YAML file: copy of the input runcard
├── md5 # checksum file to verify integrity of the fit folder
├── bayes_metrics.csv
└── full_posterior_sample.csv
The replicas
folder contains the subfolders of the replicas that were used in the fit.
Each of these folders contains an .exportgrid
file, which can be interpreted as a sample
from the posterior distribution of the PDF model.
The pdf_model.pkl
file contains the pickled PDF model used for the fit. This file can
be used for several purposes,an example is that of using it to resample from the posterior
distribution of the PDF model when a Bayesian fit is performed (See also colibri.scripts.ns_resampler).
The other files are the input data and the filter file, which is a copy of the input
runcard used for the fit.
The md5
file is a checksum file that can be used to verify the integrity of the fit folder.
The bayes_metrics.csv
file contains the metrics of the fit, such as the log-likelihood
and the evidence.
The full_posterior_sample.csv
file contains the full posterior sample of the fit
(whose size is specified in the runcard).
Depending on the type of Bayesian fit, other files may be present, for example a fit done using the ultranest will contain the following extra files:
ultranest_colibri_fit/
├── ultranest_logs/
├── ns_result.csv
While a fit done using the analytic_fit module will contain the extra following file:
analytic_colibri_fit/
├── analytic_result.csv
MC replica fit folders
Note
By “MC replica fit folder” we mean a folder containing the results of a fit performed with a Monte Carlo replica method (See [CMMM24] for more details on this method.).
A MC replica fit folder should have the following structure:
mc_replica_fit/
├── fit_replicas/ # Folder containing replica sub-folders (one per replica) with exportgrid files.
├── pdf_model.pkl # pickled PDF model used for the fit
├── input/ # directory of input data and runcard(s)
├── filter.yml # YAML file: copy of the input runcard
└── md5 # checksum file to verify integrity of the fit folder
where the fit_replicas
folder contains the subfolders of the replicas that were used in the fit.
Finding the \(\chi^2\) of a Monte Carlo fit
The \(\chi^2\) for each replica of your Monte Carlo fit will be stored in the
fit_replicas/replica_n/mc_loss.csv
file, where n is the specific replica number.
This file lists the training and validation losses for every 50 epochs. For example,
the first few lines would look like this:
epochs,training_loss,validation_loss
0,7.80859e+00,1.13569e+01
1,6.19384e+00,9.22697e+00
2,4.86740e+00,7.44600e+00
...
which would represent the losses for the first 150 epochs (i.e. 0, 1, 2 are just labels).
Postfit selection
The fit_replicas
is used by the colibri.scripts.mc_postfit
script to
perform a postfit selection of the replicas. The postfit script also takes care of creating
the replicas
folder, which is the one needed for the evolution of the fit.
You can therefore run a postfit selection of the replicas by running:
mc_postfit -c CHI2_THRESHOLD monte_carlo_output_directory
where the -c `` is optional and ``CHI2_THRESHOLD
is a number that determines
the \(\chi^2\) threshold above which a Monte Carlo replica will be rejected.
This can also be run as --chi2_threshold
instead of -c
. If no value is
specified, a default value of 1.5 will be applied.
Other options are:
--nsigma NSIGMA
: The nsigma threshold above which replicas are rejected. The default is 5.--target_replicas TARGET_REPLICAS
or-t TARGET_REPLICAS
: The target number of replicas to be produced by postfit. The default is 100.
Evolution script
The evolution script of colibri is a wrapper around the evolven3fit script
(See the colibri.scripts.evolve_fit
module’s and colibri.scripts.evolve_fit.main()
function.)
that only allows for the evolve option.
It can be executed from the command line as follows:
evolve_fit <name_fit>
where <name_fit>
is the name of the fit you want to evolve.
The script also has a --help
option that will show you all the options available.
For more information on the evolution see also the helper from the evolven3fit
script.
Postfit emulation
For Bayesian fits we don’t do any postfit selection on the posterior, however, for backwards compatibility with the validphys module we still run a postfit emulation which takes care of creating the central replica and a postfit folder containing the evolved replicas as well as the corresponding LHAPDF set.
Upload of the fit
After running the evolution script, it is possible (if the user has the right permissions) to simply upload the fit to the validphys server using the validphys script
vp-upload <name_fit>
After which the fit can be installed and made available in the environment with the command
vp-get fit <name_fit>
If the user does not have the right permissions it is recommended to simply symlink the lhapdf set to the lhapdf environment folder or to symlink the fit folder to the NNPDF/results folder of the environment.
Note
The final folder after the evolution will also contain a symlink nnfit -> replicas needed for validphys and evolven3fit as well as a postfit folder.
Resampling script
In a Colibri fit runcard, you control how many posterior samples get written out as .exportgrid files in the
replicas/
folder — and those can subsequently be evolved into a PDF set.
For a Bayesian fit using the analytical - inference method, set the total number of posterior draws via the
analytic_settings
block. For example:
# Analytic settings
analytic_settings:
n_posterior_samples: 100
full_sample_size: 50000
Likewise, if you instead use the UltraNest nested sampler, specify exactly the same parameter name under
ultranest_settings
:
# ultranest settings
ultranest_settings:
n_posterior_samples: 100
...
Key Parameters
n_posterior_samples
: The number of individual posterior draws that will each be written out as a separate.exportgrid
file in thereplicas/
folder.full_sample_size
(analytic only) : The total size of the merged posterior sample, which is saved tofull_posterior_sample.csv
at the top level of your fit directory.
Note
In the case of a fit done using the ultranest
nested sampling sampler,
the full_sample_size
defaults to an internal number that might depends on the
specific run.
If you want to draw additional replicas (or have a smaller set for a finite-size effects studies) from the posterior distribution
of an already‐completed PDF fit, you do not need to re‐run the full fit.
Instead, use the resample_fit
helper script.
Usage
To see all available options, invoke:
$ resample_fit --help
This will print out a help message that looks like this:
usage: resample_fit [-h] [--fitype FITYPE] [--nreplicas NREPLICAS] [--resampling_seed RESAMPLING_SEED]
[--resampled_fit_name RESAMPLED_FIT_NAME] [--parametrisation_scale PARAMETRISATION_SCALE]
fit_name
Script to resample from Bayesian posterior
positional arguments:
fit_name The colibri fit from which to sample.
options:
-h, --help show this help message and exit
--fitype FITYPE, -t FITYPE
The type of fit to be resampled. Currently only `ultranest` and `analytic` are supported.
--nreplicas NREPLICAS, -nrep NREPLICAS
The number of samples.
--resampling_seed RESAMPLING_SEED, -seed RESAMPLING_SEED
The random seed to be used to sample from the posterior.
--resampled_fit_name RESAMPLED_FIT_NAME, -newfit RESAMPLED_FIT_NAME
The name of the resampled fit.
--parametrisation_scale PARAMETRISATION_SCALE, -Q PARAMETRISATION_SCALE
The scale at which the PDFs are fitted.
As an example, if we want to resample from the posterior distribution of an analytical fit called my_fit
we can do it as follows:
resample_fit my_fit -t analytic -n 100 -seed 1234 -newfit my_resampled_fit
Note
Importantly, in order to resample from the posterior distribution of a fit, you need
to be in the same environment as the one used to perform the fit.
Hence, if you want to resample a fit done using the les-houches
PDF model, you need to
be in the environment where the les_houches_exe
exectuable is available.