Selections
==========

This section gives more details how to run basic selections with different selectors
using a unified input file.

The related commands are 

.. code-block:: shell

    # gdp -h for more info
    $ gdp -h

    # - results would be written to current directory
    $ gdp select ./selection.yaml -s structures.xyz

    # - if -d option is used, results would be written to it
    $ gdp -d results select ./selection.yaml -s structures.xyz

    # - accelerate the selection using 8 parallel processes,
    #   which is useful for descriptor-based selection as it requires
    #   massive computations
    $ gdp -nj 8 -d results select ./selection.yaml -s structures.xyz

The selection configuration (`./selection.yaml`) is organised as

.. code-block:: yaml

    selection:
      - method: property
        ... # define property and sparsification
        number: [512, 0.5]
      - method: descriptor
        ... # define descriptor and sparsification
        number: [128, 1.0]

This `selection` defines a sequential produces that consists of two selectors. 
Input structures will be first selected based on the property and the selected 
structures will be further selected based on the descriptor.

For the most selections, a parameter `number` is required as it determines the number of 
selected structures. The first value is a fixed number and the second value is a percentage. 
If the input dataset have 500 structures, with `number: [512, 0.5]`, 250 structures will be 
selected (500*0.5=250 as 500 < 512). Then, with `number: [128, 1.0]`, 128 structures will be 
selected (250 > 128).

After a successful selection, there are several output files. The `selected_frames.xyz` 
contains the final structures. Output files start with a number that indicates their 
oder in the list of selections. Some files, which end with `-info`, stores basic information 
of selected structures. For example, 

.. code-block:: shell

    #  index    confid      step    natoms           ene          aene        maxfrc         score
           0        -1         0        43     -196.7322       -4.5752       16.1094        0.0994
           1        -1       100        43     -242.8428       -5.6475       48.5978        0.0203
          87        -1      8700        43     -271.3238       -6.3099       30.7878        0.0164
          88        -1      8800        43     -271.2264       -6.3076       47.6111        0.0175
          89        -1      8900        43     -284.6631       -6.6201       64.4184        0.0143
          90        -1      9000        43     -303.0153       -7.0469       60.4111        0.0147
          91        -1      9100        43     -311.1232       -7.2354       66.2150        0.0120
          92        -1      9200        43     -309.4916       -7.1975       60.4200        0.0091
          93        -1      9300        43     -312.9583       -7.2781      149.9330        0.0097
          94        -1      9400        43     -314.2778       -7.3088       29.8337        0.0089
          95        -1      9500        43     -315.8645       -7.3457       34.8396        0.0114
          96        -1      9600        43     -310.2994       -7.2163       24.1396        0.0073
          97        -1      9700        43     -313.9329       -7.3008       29.1520        0.0062
          98        -1      9800        43     -327.4579       -7.6153       14.7447        0.0074
          99        -1      9900        43     -330.4879       -7.6858       20.1336        0.0083
         100        -1     10000        43     -329.5945       -7.6650       34.6097        0.0055
    # random_seed None

The first columns are structure identifiers that come from explorations, for instance, 
the candidate ID (confid) and the dynamics step (step) in MD or minimisation. Other notations are `natoms` - number of atoms, `ene` - total energy, `aene` - 
average atomic energy (ene/natoms), `maxfrc` - maximum atomic force, `score` - selection 
score whose meaning depends on the sparsification method. Units are in `eV` nad `eV/Ang`.

There have some other output files by specific selection method. Find details in the following 
subsections.

.. warning::

    When run the same selection again, `gdp` will read the cached results (`-info.txt` files).
    However, it will not check whether the input structures are different from the last time. 
    Remove output files before selection if necessary.

List of Selectors
-----------------

.. toctree::
    :maxdepth: 2

    descriptor.rst


Property
--------

`Select structures based on properties.` The property can be total energy, atomic forces, or 
any properties that can be stored in the **ase** `atoms.info`. The example below demonstrates 
that the selection based on `max_devi_f` that is the maximum deviation of force prediction by 
a committee of MLIPs.

After chosing the property, there are several sparsification methods to select structures.

- filter: 
  
    Select structures that have property within `range`. All valid structures will be 
    selected, which is not affected by the parameter `number`.

- sort: 

    Sort structures by property and select the first `number` of them. Set `reverse: True` 
    if structures with larger property values are of interest.

- hist: 

    Randomly select `number` structures based on probabilities by the histogram.
    For example, if 10 structures will be selected, dataset has 100 structures in
    bin 1 and 25 in bin 2, then roughly 8 will be from bin 1 and 2 from bin 2.

- boltz: 
 
    Randomly select `number` structures based on probabilities by the Boltzmann distribution. 
    This is useful when selecting structures based on energy-related properties. 
    The probabilty is computed as `exp(-p/kBT)` where `p` is the property value 
    and `kBT` is the custom parameter in eV.

.. code-block:: yaml
    :emphasize-lines: 7, 13

    selection:
      - method: property
        properties:
          max_devi_f:
            range: [0.05, null]
            nbins: 20
            sparsify: filter
      - method: property
        properties:
          max_devi_f:
            range: [0.05, 0.25]
            nbins: 20
            sparsify: hist
        number: [256, 1.0]


The first selection on property `max_devi_f` with `filter` will give an output file 
below

.. code-block:: yaml

    #Property max_devi_f
    # min 0.0304       max 17.9258
    # avg 0.7199       std 0.4960
    # histogram of 4914 points in the range (npoints: 5005)
          0.0500          3344
          0.9438          1547
          1.8376            11
          2.7314             2
          3.6252             4
          4.5189             3
          5.4127             1
          6.3065             0
          7.2003             0
          8.0941             0
          8.9879             0
          9.8817             0
         10.7755             0
         11.6693             0
         12.5631             0
         13.4568             1
         14.3506             0
         15.2444             0
         16.1382             0
         17.0320             1

There 4914 structure from 5005 have `max_devi_f` within [0.05,inf]. The rest 91 
structures have a `max_devi_f` smaller than 0.05.
        
Graph
-----

...