mace model training tips

Bug fix

Prepare Data

mace_prepare_data doesn't support loading multiple file by glob, which make it almost impossible to use it to prepare data. This PR can fix the problem: ACEsuit/mace#1313

After applying the PR, one can use the following command to prepare data

mkdir -p dataset
mace_prepare_data \
    --train_file="mptrj-gga-ggapu/*.extxyz" \
    --valid_fraction=0.05 \
    --r_max=4.5 \
    --h5_prefix="dataset/" \
    --compute_statistics \
    --E0s="average" \
    --energy_key="energy" \
    --forces_key="forces" \
    --seed=123 \

Fix statistics.json

The file generated by mace_prepare_data is broken, the following script can be used to fix it.

# fix.py
import re

def replace_np_types(content):
    pattern = r'np\.(int64|float64)\(([-+]?\d+\.?\d*e?[-+]?\d*)\)'
    return re.sub(pattern, r'\2', content)

with open('./dataset/statistics.json', 'r+', encoding='utf-8') as f:
    content = f.read()
    cleaned_content = replace_np_types(content)
    f.seek(0)
    f.write(cleaned_content)
    f.truncate()

Run training

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
MACE_SCRIPT=/root/ikkem-test/mace-env/lib/python3.10/site-packages/mace/cli/run_train.py
torchrun --nproc_per_node=8 $MACE_SCRIPT \
    --name=Mace-model \
    --distributed \
    --num_workers=8 \
    --device=cuda \
    --launcher=torchrun \
    --train_file="/root/ikkem-test/dataset/train" \
    --valid_file="/root/ikkem-test/dataset/val" \
    --statistics_file="/root/ikkem-test/dataset/statistics.json" \
    --loss='universal' \
    --energy_weight=1 \
    --forces_weight=10 \
    --compute_stress=True \
    --stress_weight=100 \
    --stress_key='stress' \
    --eval_interval=1 \
    --error_table='PerAtomMAE' \
    --model="ScaleShiftMACE" \
    --interaction_first="RealAgnosticResidualInteractionBlock" \
    --interaction="RealAgnosticResidualInteractionBlock" \
    --num_interactions=2 \
    --correlation=3 \
    --max_ell=3 \
    --r_max=6.0 \
    --max_L=0 \
    --num_channels=128 \
    --num_radial_basis=10 \
    --MLP_irreps="16x0e" \
    --scaling='rms_forces_scaling' \
    --lr=0.005 \
    --weight_decay=1e-8 \
    --ema \
    --ema_decay=0.995 \
    --scheduler_patience=5 \
    --batch_size=16 \
    --valid_batch_size=32 \
    --max_num_epochs=200 \
    --patience=50 \
    --amsgrad \
    --seed=1 \
    --clip_grad=100 \
    --keep_checkpoints \
    --save_cpu \
    --energy_key="energy" \
    --forces_key="forces" \

link89/mace-training.md

Select an option