I could be missing something, but is there some sort of metric for these comparisons to other software? Like the BLEU score which I've seen in studies relating to comparing LLMs to Google Translate. I find it difficult to believe it is better than DeepL in a vacuum.