Colibri fit folders

A Colibri fit folder is the output resulting from a Colibri fit. It is a folder containing a set of relevant information for the fit. Currently, we distinguish between two types of fit folders: Bayesian fit folders and Monte Carlo replica fit folders.

Bayesian fit folders

Note

By “Bayesian fit folder” we mean a folder containing the results of a fit performed with a Bayesian sampling method (see this section for details on how to run a Bayesian fit).

Any Bayesian fit folder should contain the following files:

colibri_fit/
 ├── bayes_metrics.csv
 ├── filter.yml                  # YAML file: copy of the input runcard
 ├── full_posterior_sample.csv
 ├── input/                      # directory of input data and runcard(s)
 ├── md5                         # checksum file to verify integrity of the fit folder
 ├── pdf_model.pkl               # pickled PDF model used for the fit
 └── replicas/                   # Folder containing replica sub-folders (one per replica) with exportgrid files.

The replicas folder contains the subfolders of the replicas that were used in the fit. Each of these folders contains an .exportgrid file, which can be interpreted as a sample from the posterior distribution of the PDF model. The pdf_model.pkl file contains the pickled PDF model used for the fit. This file can be used for several purposes. For example, it can be used to resample from the posterior distribution of the PDF model when a Bayesian fit is performed (See also colibri.scripts.ns_resampler). The input/ directory contains data and the filter.yml file is a copy of the input runcard used for the fit. The md5 file is a checksum file that can be used to verify the integrity of the fit folder. The bayes_metrics.csv file contains the metrics of the fit, such as the log-likelihood and the evidence. The full_posterior_sample.csv file contains the full posterior sample of the fit (whose size is specified in the runcard).

Depending on the type of Bayesian fit, other files may be present. For example for a Bayesian fit using UltraNest, the following files will be present if sampler_plot is set to true:

ultranest_colibri_fit/
├── ultranest_logs/
├── ns_result.csv

While a fit done using the analytic_fit module will contain the following extra file:

analytic_colibri_fit/
├── analytic_result.csv

Finding the \(\chi^2\) of a Bayesian Fit

The \(\chi^2\) for a Bayesian fit is stored in the bayes_metrics.csv file, which looks like this:

bayes_complexity,avg_chi2,min_chi2,logz
6.693346300122812,3633.618330629202,3.62692e+03,-1.83561e+03

After running a Bayesian fit, you should evolve it as described in Evolution script.

MC replica fit folders

Note

By “MC replica fit folder” we mean a folder containing the results of a fit performed with a Monte Carlo replica method (See [CMMM24] for more details on this method.).

A MC replica fit folder should have the following structure:

mc_replica_fit/
 ├── filter.yml           # YAML file: copy of the input runcard
 ├── fit_replicas/        # Folder containing replica sub-folders (one per replica) with exportgrid files.
 ├── input/               # directory of input data and runcard(s)
 ├── md5                  # checksum file to verify integrity of the fit folder
 └── pdf_model.pkl        # pickled PDF model used for the fit

where the fit_replicas folder contains the subfolders of the replicas that were used in the fit. The other files/folders are analogous to the ones produced by a Bayesian fit, discussed above.

Finding the \(\chi^2\) of a Monte Carlo fit

The \(\chi^2\) for each replica of a MC fit is stored in the fit_replicas/replica_n/mc_loss.csv file, where n is the specific replica number. This file lists the training and validation losses for every 50 epochs. For example, the first few lines would look like this:

epochs,training_loss,validation_loss
0,7.80859e+00,1.13569e+01
1,6.19384e+00,9.22697e+00
2,4.86740e+00,7.44600e+00
...

which would represent the losses for the first 150 epochs (i.e. 0, 1, 2 are just labels).

Postfit selection

After running a MC fit, you should run a postfit selection of the replicas. This is done by the colibri.scripts.mc_postfit script, which uses the fit_replicas` and creates a new replicas folder, which contains the replicas that pass the postfit, and are the ones used to evolve the fit.

You can run a postfit selection by running:

mc_postfit -c CHI2_THRESHOLD monte_carlo_output_directory

where the -c is optional and CHI2_THRESHOLD is a number that determines the \(\chi^2\) threshold above which a MC replica will be rejected, where this value is taken from the last row of the training_loss column shown above. This can also be run as --chi2_threshold instead of -c. If no value is specified, a default value of 1.5 will be applied.

Other options are:

--nsigma NSIGMA: The nsigma threshold above which replicas are rejected. The default is 5.
--target_replicas TARGET_REPLICAS or -t TARGET_REPLICAS: The target number of replicas to be produced by postfit. The default is 100.

After running a postfit selection of a MC fit, you should evolve it as described in this section.