Text Embeddings
Protean uses text embeddings for scientific evidence organization, not peptide sequence similarity. This distinction is deliberate: natural-language records and amino-acid sequences are different data surfaces with different failure modes.
The primary route is BAAI/bge-m3. If it is unavailable, the runtime can fall back to nomic-ai/nomic-embed-text-v1.5, and then to deterministic lexical embeddings for degraded operation.
Route Boundary
Text embeddings are used for:
- literature and evidence record retrieval
- candidate explanation context
- generated paper context
- failure-record search
- local paper and patent record organization when those records exist
They are not used for:
- peptide sequence similarity
- protein language modeling
- novelty scoring over amino-acid strings
- deterministic validation
- scoring override
ESM remains the protein/peptide sequence embedding route.
Cache Behavior
The embedding layer writes a bounded local SQLite cache so repeated retrieval does not require recomputing every text vector.
data/cache/text_embeddings.sqlite
-> bounded row cache
-> local-only model loading
-> deterministic fallback when models are unavailable
The cache is operational infrastructure. It improves speed and reproducibility, but it does not turn retrieved sources into biological proof.
Degraded Mode
If BGE and Nomic are unavailable, the platform uses deterministic text hashing and lexical overlap. That fallback is lower quality but stable, local, and explicit in artifacts.
