# Input Generation To follow this tutorial, please download and unzip `Example1/` from [this link](https://drive.google.com/file/d/1xBj3iP4eUInB2OKxCHstel4CTyXUDU9t/view?usp=drivesdk) (177 MB). ---- `VacHopPy` uses the **HDF5 format** as its primary input for analysis. This format allows for highly efficient, streaming-based data access. As a result, `VacHopPy` can process massive trajectory datasets (hundreds of gigabytes) quickly while consuming minimal RAM, typically only a few gigabytes. Before starting an analysis, you must convert your MD trajectory into this HDF5 format. While this can be done using the Python API (see `parse_md()` and `parse_lammps()`), the most convenient method is the command-line interface (CLI). `VacHopPy` employs the Atomic Simulation Environment (**ASE**) to parse position and force data from a wide variety of MD formats. For a full list of compatible file formats, please see the [official ASE documentation](https://ase-lib.org/ase/io/io.html). ```{warning} Your MD trajectory file **must** contain both **position** and **force** data for each atom. `VacHopPy` uses force information to accurately determine site occupations. ``` ```{note} Based on our experience, ASE can have issues parsing the `lammps-dump-text` format. To ensure robust handling of these files, `VacHopPy` automatically uses **MDAnalysis** as a backend for this specific format. ``` ---- ## How to Convert Trajectories to HDF5 To convert your MD trajectory into the HDF5 format, `VacHopPy` provides the `convert` command. ```bash vachoppy convert [PATH_TRAJ] [TEMPERATURE] [TIMESTEP] --label [LABEL] ``` This command takes the following primary arguments: * **`PATH_TRAJ`**: The path to your source MD trajectory file (e.g., `vasprun.xml`). * **`TEMPERATURE`**: The simulation temperature in Kelvin. * **`TIMESTEP`**: The time step between frames in femtoseconds (fs). * **`--label [LABEL]`** (Optional): A suffix to append to the output HDF5 filenames. For a full list of all available options, use the `-h` flag: ```bash vachoppy convert -h ``` --- ## Converting VASP Results into HDF5 Navigate into the example folder you downloaded (`path/to/Example1/`). The directory contains two subfolders: `VASP/` and `LAMMPS/`. Each subfolder contains MD outputs produced with the corresponding code. First, enter the `VASP/` directory: ```bash cd path/to/Example1/VASP ls # >> OUTCAR_RUN01 OUTCAR_RUN02 ``` The `VASP/` folder contains two files: `OUTCAR_RUN01` and `OUTCAR_RUN02`. These are two consecutive MD runs. It is common to continue a simulation after the initial run if hopping events were not sufficiently sampled — the two files here represent such a continuation. The example system is **rutile TiO₂** containing **two oxygen vacancies**. Run the following vachoppy convert commands to convert each OUTCAR run into HDF5 input: ```bash vachoppy convert OUTCAR_RUN01 2100 2 --label 01 # produces: TRAJ_Ti_01.h5, TRAJ_O_01.h5 vachoppy convert OUTCAR_RUN02 2100 2 --label 02 # produces: TRAJ_Ti_02.h5, TRAJ_O_02.h5 ``` Each command creates two HDF5 files (one per atomic species) because `VacHopPy` stores trajectories split by atomic species to save disk space and enable efficient streaming access. For this tutorial, we are interested only in oxygen vacancies, so from now on we will use the `TRAJ_O_*.h5` files. ```{note} The explanations in this section apply equally to files in formats other than `lammps-dump-text` (e.g., extxyz). ``` ----- ## Concatenating Two HDF5 Files The generated `TRAJ_O_01.h5` and `TRAJ_O_02.h5` files contain two consecutive segments of the same MD trajectory. Combining them into a single HDF5 file simplifies file management and subsequent analyses. `VacHopPy` provides the `concat` command to join two consecutive HDF5 files. ```bash vachoppy concat TRAJ_O_01.h5 TRAJ_O_02.h5 # produces: TRAJ_O_CONCAT.h5 ``` This command writes a single output file (by default `TRAJ_O_CONCAT.h5`) containing frames from the first file followed by frames from the second file. When concatenating, `VacHopPy` takes periodic boundary conditions (PBC) into account and applies a positional offset to the second file. ---- ## Converting LAMMPS Results into HDF5 Enter the `LAMMPS/` directory: ```bash cd path/to/Example1/LAMMPS ls # >> lammps.data lammps.dump ``` The directory contains two files: `lammps.data`, which defines the initial atomic structure, and `lammps.dump`, which stores the MD trajectory. Because LAMMPS outputs can vary widely depending on user settings, converting its trajectories requires explicitly specifying the file format and atom styles. To convert the trajectory into HDF5 format, run: ```bash vachoppy convert lammps.dump 2100 2 \ --label 01 \ --format lammps-dump-text \ --lammps_data lammps.data \ --atom_style_data 'id type x y z' \ --atom_style_dump 'id type x y z fx fy fz' \ --atom_symbols 1=Ti 2=O # produces: TRAJ_Ti_01.h5 TRAJ_O_01.h5 ``` This command includes several additional arguments compared to the VASP example: * **`--format`**: Specifies the input file format (`lammps-dump-text`). * **`--lammps_data`**: Path to the LAMMPS data file defining the initial structure. * **`--atom_style_data`**: Atom style used in the data file (e.g., 'id type x y z'). * **`--atom_style_dump`**: Atom style used in the dump file (e.g., 'id type x y z fx fy fz'). * **`--atom_symbols`**: Mapping between atom types and chemical symbols (e.g., 1=Ti 2=O). This command creates two HDF5 files, `TRAJ_Ti_01.h5` and `TRAJ_O_01.h5`, just like in the VASP example. For more details on parsing LAMMPS trajectories and supported atom styles, refer to the [official MDAnalysis documentation](https://www.mdanalysis.org). ---- ## How to Read HDF5 Files You can inspect the contents of an HDF5 trajectory using the `show` command: ```bash vachoppy show TRAJ_O_01.h5 ``` This command prints key information about the stored MD simulation. ```{code-block} bash :class: scrollable-output ================================================== Trajectory File: TRAJ_O_01.h5 ================================================== [Simulation Parameters] - Atomic Symbol: O - Number of Frames: 10000 - Temperature: 2100.0 K - Time Step: 2.0 fs [Composition] - Counts: O: 46, Ti: 24 - Total Atoms: 70 [Lattice Vectors (Å)] [ 9.16177, 0.00000, 0.00000] [ 0.00000, 9.16177, 0.00000] [ 0.00000, 0.00000, 8.92242] [Stored Datasets] - positions: Shape = (10000, 46, 3) - forces: Shape = (10000, 46, 3) ================================================== ``` If you want to directly access the stored data for your own analysis, you can use the following Python script: ```python import h5py import json import numpy as np path_traj = 'TRAJ_O_01.h5' # HDF5 file to read with h5py.File(path_traj, 'r') as f: conditions = json.loads(f.attrs['metadata']) positions = np.array(f['positions'][:], dtype=np.float64) forces = np.array(f['forces'][:], dtype=np.float64) print('=' * 40) print(' Metadata') print('=' * 40) print(f" Chemical Symbol : {conditions['symbol']}") print(f" Number of Frames : {conditions['nsw']}") print(f" Temperature (K) : {conditions['temperature']}") print(f" Timestep (fs) : {conditions['dt']}") print('=' * 40) print(' Main Data') print('=' * 40) print(f" Shape of Positions: {positions.shape}") print(f" Shape of Forces : {forces.shape}") print('=' * 40) ``` Example output: ```bash ======================================== Metadata ======================================== Chemical Symbol : O Number of Frames : 10000 Temperature (K) : 2100.0 Timestep (fs) : 2.0 ======================================== Main Data ======================================== Shape of Positions: (10000, 46, 3) Shape of Forces : (10000, 46, 3) ======================================== ``` The datasets `positions` and `forces` are stored in the shape `(n_frames, n_atoms, 3)`. Note that the `positions` contains **PBC-unwrapped fractional coordinates**. It can be converted to Cartesian coordiantes using lattice parameters, which can be obtained from `conditions['lattice']`.