# Input Generation


To follow this tutorial, please download and unzip `Example1/` from [this link](https://drive.google.com/file/d/1xBj3iP4eUInB2OKxCHstel4CTyXUDU9t/view?usp=drivesdk) (177 MB).

----

`VacHopPy` uses the **HDF5 format** as its primary input for analysis. This format allows for highly efficient, streaming-based data access. As a result, `VacHopPy` can process massive trajectory datasets (hundreds of gigabytes) quickly while consuming minimal RAM, typically only a few gigabytes.

Before starting an analysis, you must convert your MD trajectory into this HDF5 format. While this can be done using the Python API (see `parse_md()` and `parse_lammps()`), the most convenient method is the command-line interface (CLI). `VacHopPy` employs the Atomic Simulation Environment (**ASE**) to parse position and force data from a wide variety of MD formats. For a full list of compatible file formats, please see the [official ASE documentation](https://ase-lib.org/ase/io/io.html).

```{warning}
Your MD trajectory file **must** contain both **position** and **force** data for each atom. `VacHopPy` uses force information to accurately determine site occupations.
```

```{note}
Based on our experience, ASE can have issues parsing the `lammps-dump-text` format. To ensure robust handling of these files, `VacHopPy` automatically uses **MDAnalysis** as a backend for this specific format.
```

----

## How to Convert Trajectories to HDF5

To convert your MD trajectory into the HDF5 format, `VacHopPy` provides the `convert` command.

```bash
vachoppy convert [PATH_TRAJ] [TEMPERATURE] [TIMESTEP] --label [LABEL]
```

This command takes the following primary arguments:

* **`PATH_TRAJ`**: The path to your source MD trajectory file (e.g., `vasprun.xml`).

* **`TEMPERATURE`**: The simulation temperature in Kelvin.

* **`TIMESTEP`**: The time step between frames in femtoseconds (fs).

* **`--label [LABEL]`** (Optional): A suffix to append to the output HDF5 filenames.

For a full list of all available options, use the `-h` flag:

```bash
vachoppy convert -h
```

---

## Converting VASP Results into HDF5

Navigate into the example folder you downloaded (`path/to/Example1/`). The directory contains two subfolders: `VASP/` and `LAMMPS/`. Each subfolder contains MD outputs produced with the corresponding code. First, enter the `VASP/` directory:

```bash
cd path/to/Example1/VASP
ls
# >> OUTCAR_RUN01  OUTCAR_RUN02
```

The `VASP/` folder contains two files: `OUTCAR_RUN01` and `OUTCAR_RUN02`. These are two consecutive MD runs. It is common to continue a simulation after the initial run if hopping events were not sufficiently sampled — the two files here represent such a continuation. The example system is **rutile TiO₂** containing **two oxygen vacancies**.

Run the following vachoppy convert commands to convert each OUTCAR run into HDF5 input:

```bash
vachoppy convert OUTCAR_RUN01 2100 2 --label 01
# produces: TRAJ_Ti_01.h5, TRAJ_O_01.h5

vachoppy convert OUTCAR_RUN02 2100 2 --label 02
# produces: TRAJ_Ti_02.h5, TRAJ_O_02.h5
```

Each command creates two HDF5 files (one per atomic species) because `VacHopPy` stores trajectories split by atomic species to save disk space and enable efficient streaming access. For this tutorial, we are interested only in oxygen vacancies, so from now on we will use the `TRAJ_O_*.h5` files.

```{note}
The explanations in this section apply equally to files in formats other than `lammps-dump-text` (e.g., extxyz).
```

-----

## Concatenating Two HDF5 Files

The generated `TRAJ_O_01.h5` and `TRAJ_O_02.h5` files contain two consecutive segments of the same MD trajectory. Combining them into a single HDF5 file simplifies file management and subsequent analyses. `VacHopPy` provides the `concat` command to join two consecutive HDF5 files.

```bash
vachoppy concat TRAJ_O_01.h5 TRAJ_O_02.h5
# produces: TRAJ_O_CONCAT.h5
```

This command writes a single output file (by default `TRAJ_O_CONCAT.h5`) containing frames from the first file followed by frames from the second file. When concatenating, `VacHopPy` takes periodic boundary conditions (PBC) into account and applies a positional offset to the second file.

----

## Converting LAMMPS Results into HDF5

Enter the `LAMMPS/` directory:

```bash
cd path/to/Example1/LAMMPS
ls
# >> lammps.data  lammps.dump
```

The directory contains two files: `lammps.data`, which defines the initial atomic structure, and `lammps.dump`, which stores the MD trajectory. Because LAMMPS outputs can vary widely depending on user settings, converting its trajectories requires explicitly specifying the file format and atom styles.

To convert the trajectory into HDF5 format, run:

```bash
vachoppy convert lammps.dump 2100 2 \
  --label 01 \
  --format lammps-dump-text \
  --lammps_data lammps.data \
  --atom_style_data 'id type x y z' \
  --atom_style_dump 'id type x y z fx fy fz' \
  --atom_symbols 1=Ti 2=O
# produces: TRAJ_Ti_01.h5  TRAJ_O_01.h5
```

This command includes several additional arguments compared to the VASP example:

* **`--format`**: Specifies the input file format (`lammps-dump-text`).

* **`--lammps_data`**: Path to the LAMMPS data file defining the initial structure.

* **`--atom_style_data`**: Atom style used in the data file (e.g., 'id type x y z').

* **`--atom_style_dump`**: Atom style used in the dump file (e.g., 'id type x y z fx fy fz').

* **`--atom_symbols`**: Mapping between atom types and chemical symbols (e.g., 1=Ti 2=O).

This command creates two HDF5 files, `TRAJ_Ti_01.h5` and `TRAJ_O_01.h5`, just like in the VASP example. For more details on parsing LAMMPS trajectories and supported atom styles, refer to the [official MDAnalysis documentation](https://www.mdanalysis.org).

----

## How to Read HDF5 Files

You can inspect the contents of an HDF5 trajectory using the `show` command:

```bash
vachoppy show TRAJ_O_01.h5
```

This command prints key information about the stored MD simulation.

```{code-block} bash
:class: scrollable-output

==================================================
  Trajectory File: TRAJ_O_01.h5
==================================================

[Simulation Parameters]
  - Atomic Symbol:      O
  - Number of Frames:   10000
  - Temperature:        2100.0 K
  - Time Step:          2.0 fs

[Composition]
  - Counts:             O: 46, Ti: 24
  - Total Atoms:        70

[Lattice Vectors (Å)]
  [  9.16177,   0.00000,   0.00000]
  [  0.00000,   9.16177,   0.00000]
  [  0.00000,   0.00000,   8.92242]

[Stored Datasets]
  - positions:          Shape = (10000, 46, 3)
  - forces:             Shape = (10000, 46, 3)
==================================================
```


If you want to directly access the stored data for your own analysis, you can use the following Python script:

```python
import h5py
import json
import numpy as np

path_traj = 'TRAJ_O_01.h5'  # HDF5 file to read

with h5py.File(path_traj, 'r') as f:
    conditions = json.loads(f.attrs['metadata'])
    positions = np.array(f['positions'][:], dtype=np.float64)
    forces = np.array(f['forces'][:], dtype=np.float64)

print('=' * 40)
print('  Metadata')
print('=' * 40)
print(f"    Chemical Symbol   : {conditions['symbol']}")
print(f"    Number of Frames  : {conditions['nsw']}")
print(f"    Temperature (K)   : {conditions['temperature']}")
print(f"    Timestep (fs)     : {conditions['dt']}")
print('=' * 40)
print('  Main Data')
print('=' * 40)
print(f"    Shape of Positions: {positions.shape}")
print(f"    Shape of Forces   : {forces.shape}")
print('=' * 40)
```

Example output:

```bash
========================================
  Metadata
========================================
    Chemical Symbol   : O
    Number of Frames  : 10000
    Temperature (K)   : 2100.0
    Timestep (fs)     : 2.0
========================================
  Main Data
========================================
    Shape of Positions: (10000, 46, 3)
    Shape of Forces   : (10000, 46, 3)
========================================
```

The datasets `positions` and `forces` are stored in the shape `(n_frames, n_atoms, 3)`. Note that the `positions` contains **PBC-unwrapped fractional coordinates**. It can be converted to Cartesian coordiantes using lattice parameters, which can be obtained from `conditions['lattice']`.