Data Preparation#

We suggest the user use a data parsing tool dftio to directly convert the output data from DFT calculation into readable datasets. Our implementation supports the parsed dataset format of dftio. Users can just clone the dftio repository and run pip install . in its root directory. Then one can use the following parsing command for the parallel data processing directly from the DFT output:

usage: dftio parse [-h] [-ll {DEBUG,3,INFO,2,WARNING,1,ERROR,0}] [-lp LOG_PATH] [-m MODE] [-n NUM_WORKERS] [-r ROOT] [-p PREFIX] [-o OUTROOT] [-f FORMAT] [-ham] [-ovp] [-dm] [-eig]

optional arguments:
  -h, --help            show this help message and exit
  -ll {DEBUG,3,INFO,2,WARNING,1,ERROR,0}, --log-level {DEBUG,3,INFO,2,WARNING,1,ERROR,0}
                        set verbosity level by string or number, 0=ERROR, 1=WARNING, 2=INFO and 3=DEBUG (default: INFO)
  -lp LOG_PATH, --log-path LOG_PATH
                        set log file to log messages to disk, if not specified, the logs will only be output to console (default: None)
  -m MODE, --mode MODE  The name of the DFT software. (default: abacus)
  -n NUM_WORKERS, --num_workers NUM_WORKERS
                        The number of workers used to parse the dataset. (For n>1, we use the multiprocessing to accelerate io.) (default: 1)
  -r ROOT, --root ROOT  The root directory of the DFT files. (default: ./)
  -p PREFIX, --prefix PREFIX
                        The prefix of the DFT files under root. (default: frame)
  -o OUTROOT, --outroot OUTROOT
                        The output root directory. (default: ./)
  -f FORMAT, --format FORMAT
                        The output root directory. (default: dat)
  -ham, --hamiltonian   Whether to parse the Hamiltonian matrix. (default: False)
  -ovp, --overlap       Whether to parse the Overlap matrix (default: False)
  -dm, --density_matrix
                        Whether to parse the Density matrix (default: False)
  -eig, --eigenvalue    Whether to parse the kpoints and eigenvalues (default: False)

After parsing, the user need to write a info.json file and put it in the dataset. For default dataset type, the info.json looks like:

{
    "nframes": 1,
    "pos_type": "cart",
    "AtomicData_options": {
        "r_max": 7.0,
        "pbc": true
    }
}

Here pos_type can be cart, dirc or ase. For dftio output dataset, we use cart by default. The r_max, in principle, should align with the orbital cutoff in the DFT calculation. For a single element, the r_max should be a float number, indicating the largest bond distance included. When the system has multiple atoms, the r_max can also be a dict of atomic species-specific number like {A: 7.0, B: 8.0}. Then the largest bond A-A would be 7 and A-B be (7+8)/2=7.5, and B-B would be 8. pbc can be a bool variable, indicating the open or close of the periodic boundary conditions of the model. It can also be a list of three bool elements like [true, true, false], which means we can set the periodicity of each direction independently.

For LMDB type Dataset, the info.json is much simpler, which looks like this:

{
    "r_max": 7.0
}

Where other information has been stored in the dataset. LMDB dataset is designed for handeling very large data that cannot be fit into the memory directly.

Then you can set the data_options in the input parameters to point directly to the prepared dataset, like:

"data_options": {
        "train": {
            "root": "./data",
            "prefix": "Si64",
            "get_Hamiltonian": true,
            "get_overlap": true
        }
    }

If you are using a python script, the dataset can be build with the same parameters using build_datasets:

from dptb.data import build_dataset

dataset = build_dataset(
    root="your dataset root",
    type="DefaultDataset",
    prefix="frame",
    get_overlap=True,
    get_Hamiltonian=True,
    basis={"Si":"2s2p1d"}
    )