Selections

This section gives more details how to run basic selections with different selectors using a unified input file.

The related commands are

# gdp -h for more info
$ gdp -h

# - results would be written to current directory
$ gdp select ./selection.yaml -s structures.xyz

# - if -d option is used, results would be written to it
$ gdp -d results select ./selection.yaml -s structures.xyz

# - accelerate the selection using 8 parallel processes,
#   which is useful for descriptor-based selection as it requires
#   massive computations
$ gdp -nj 8 -d results select ./selection.yaml -s structures.xyz

The selection configuration (./selection.yaml) is organised as

selection:
  - method: property
    ... # define property and sparsification
    number: [512, 0.5]
  - method: descriptor
    ... # define descriptor and sparsification
    number: [128, 1.0]

This selection defines a sequential produces that consists of two selectors. Input structures will be first selected based on the property and the selected structures will be further selected based on the descriptor.

For the most selections, a parameter number is required as it determines the number of selected structures. The first value is a fixed number and the second value is a percentage. If the input dataset have 500 structures, with number: [512, 0.5], 250 structures will be selected (500*0.5=250 as 500 < 512). Then, with number: [128, 1.0], 128 structures will be selected (250 > 128).

After a successful selection, there are several output files. The selected_frames.xyz contains the final structures. Output files start with a number that indicates their oder in the list of selections. Some files, which end with -info, stores basic information of selected structures. For example,

#  index    confid      step    natoms           ene          aene        maxfrc         score
       0        -1         0        43     -196.7322       -4.5752       16.1094        0.0994
       1        -1       100        43     -242.8428       -5.6475       48.5978        0.0203
      -1      8700        43     -271.3238       -6.3099       30.7878        0.0164
      -1      8800        43     -271.2264       -6.3076       47.6111        0.0175
      -1      8900        43     -284.6631       -6.6201       64.4184        0.0143
      -1      9000        43     -303.0153       -7.0469       60.4111        0.0147
      -1      9100        43     -311.1232       -7.2354       66.2150        0.0120
      -1      9200        43     -309.4916       -7.1975       60.4200        0.0091
      -1      9300        43     -312.9583       -7.2781      149.9330        0.0097
      -1      9400        43     -314.2778       -7.3088       29.8337        0.0089
      -1      9500        43     -315.8645       -7.3457       34.8396        0.0114
      -1      9600        43     -310.2994       -7.2163       24.1396        0.0073
      -1      9700        43     -313.9329       -7.3008       29.1520        0.0062
      -1      9800        43     -327.4579       -7.6153       14.7447        0.0074
      -1      9900        43     -330.4879       -7.6858       20.1336        0.0083
      -1     10000        43     -329.5945       -7.6650       34.6097        0.0055
# random_seed None

The first columns are structure identifiers that come from explorations, for instance, the candidate ID (confid) and the dynamics step (step) in MD or minimisation. Other notations are natoms - number of atoms, ene - total energy, aene - average atomic energy (ene/natoms), maxfrc - maximum atomic force, score - selection score whose meaning depends on the sparsification method. Units are in eV nad eV/Ang.

There have some other output files by specific selection method. Find details in the following subsections.

Warning

When run the same selection again, gdp will read the cached results (-info.txt files). However, it will not check whether the input structures are different from the last time. Remove output files before selection if necessary.

List of Selectors

Descriptor

Property

Select structures based on properties. The property can be total energy, atomic forces, or any properties that can be stored in the ase atoms.info. The example below demonstrates that the selection based on max_devi_f that is the maximum deviation of force prediction by a committee of MLIPs.

After chosing the property, there are several sparsification methods to select structures.

filter:

Select structures that have property within range. All valid structures will be selected, which is not affected by the parameter number.
sort:

Sort structures by property and select the first number of them. Set reverse: True if structures with larger property values are of interest.
hist:

Randomly select number structures based on probabilities by the histogram. For example, if 10 structures will be selected, dataset has 100 structures in bin 1 and 25 in bin 2, then roughly 8 will be from bin 1 and 2 from bin 2.
boltz:

Randomly select number structures based on probabilities by the Boltzmann distribution. This is useful when selecting structures based on energy-related properties. The probabilty is computed as exp(-p/kBT) where p is the property value and kBT is the custom parameter in eV.

selection:
  - method: property
    properties:
      max_devi_f:
        range: [0.05, null]
        nbins: 20
        sparsify: filter
  - method: property
    properties:
      max_devi_f:
        range: [0.05, 0.25]
        nbins: 20
        sparsify: hist
    number: [256, 1.0]

The first selection on property max_devi_f with filter will give an output file below

#Property max_devi_f
# min 0.0304       max 17.9258
# avg 0.7199       std 0.4960
# histogram of 4914 points in the range (npoints: 5005)
0500          3344
9438          1547
8376            11
7314             2
6252             4
5189             3
4127             1
3065             0
2003             0
0941             0
9879             0
8817             0
7755             0
6693             0
5631             0
4568             1
3506             0
2444             0
1382             0
0320             1

There 4914 structure from 5005 have max_devi_f within [0.05,inf]. The rest 91 structures have a max_devi_f smaller than 0.05.

Graph

…