Text validation framework for Python - test AI outputs against BLEU, ROUGE, semantic similarity, and other metrics.
AI-generated text is hard to validate. You can't assertEqual on free-form output, and eyeballing samples doesn't scale. You need metrics that capture different quality dimensions, with thresholds you can tune per use case.
Veritext provides composable validators for text quality. Pick the metrics you care about, set thresholds, compose them with boolean logic. Use cases range from chatbot output quality to summariser fidelity to content generator consistency. It plugs into existing Python test infrastructure via pytest, so adopting it doesn't mean learning a new framework.
Five metric families: BLEU (n-gram precision against a reference), ROUGE (recall-oriented overlap), lexical similarity (edit distance, Jaccard), readability (Flesch-Kincaid and similar), and semantic similarity (sentence-transformers embeddings, cosine similarity). Each metric is a standalone validator with a configurable threshold.
Compose validators with all_of (every metric must pass) and any_of (at least one must pass) for complex validation rules. Semantic similarity catches paraphrases that lexical metrics miss entirely, but it's opt-in because the model download is large and inference is slow. Users who only need lexical metrics don't pay for it.
Ships as a pytest plugin. validate_text() integrates with standard test discovery, so adding text quality assertions feels like writing normal tests, not learning a new framework. Structured failure messages tell you which metrics failed and by how much, not just 'assertion failed'.
The regression detection workflow benchmarks a set of outputs, then checks future versions against the benchmark. Quality regressions get caught the same way unit tests catch functional regressions: a failing test in CI.
A command-line tool for batch validation against JSONL files. Each line is an input/reference/output tuple. Useful for validating a dataset of outputs before deploying a new model version or after a prompt change. Reports per-metric scores and aggregate pass/fail, and can be wired into CI pipelines alongside unit tests.