Text Embeddings

Protean uses text embeddings for scientific evidence organization, not peptide sequence similarity. This distinction is deliberate: natural-language records and amino-acid sequences are different data surfaces with different failure modes.

The primary route is BAAI/bge-m3. If it is unavailable, the runtime can fall back to nomic-ai/nomic-embed-text-v1.5, and then to deterministic lexical embeddings for degraded operation.

Route Boundary

Text embeddings are used for:

literature and evidence record retrieval
candidate explanation context
generated paper context
failure-record search
local paper and patent record organization when those records exist

They are not used for:

peptide sequence similarity
protein language modeling
novelty scoring over amino-acid strings
deterministic validation
scoring override

ESM remains the protein/peptide sequence embedding route.

Cache Behavior

The embedding layer writes a bounded local SQLite cache so repeated retrieval does not require recomputing every text vector.

data/cache/text_embeddings.sqlite
-> bounded row cache
-> local-only model loading
-> deterministic fallback when models are unavailable

The cache is operational infrastructure. It improves speed and reproducibility, but it does not turn retrieved sources into biological proof.

Degraded Mode

If BGE and Nomic are unavailable, the platform uses deterministic text hashing and lexical overlap. That fallback is lower quality but stable, local, and explicit in artifacts.

PreviousModel Layer NextEvidence Retrieval