Skip to content
Protean LabsDocs

Text Embeddings

Protean uses text embeddings for scientific evidence organization, not peptide sequence similarity. This distinction is deliberate: natural-language records and amino-acid sequences are different data surfaces with different failure modes.

The primary route is BAAI/bge-m3. If it is unavailable, the runtime can fall back to nomic-ai/nomic-embed-text-v1.5, and then to deterministic lexical embeddings for degraded operation.

Route Boundary

Text embeddings are used for:

  • literature and evidence record retrieval
  • candidate explanation context
  • generated paper context
  • failure-record search
  • local paper and patent record organization when those records exist

They are not used for:

  • peptide sequence similarity
  • protein language modeling
  • novelty scoring over amino-acid strings
  • deterministic validation
  • scoring override

ESM remains the protein/peptide sequence embedding route.

Cache Behavior

The embedding layer writes a bounded local SQLite cache so repeated retrieval does not require recomputing every text vector.

data/cache/text_embeddings.sqlite
-> bounded row cache
-> local-only model loading
-> deterministic fallback when models are unavailable

The cache is operational infrastructure. It improves speed and reproducibility, but it does not turn retrieved sources into biological proof.

Degraded Mode

If BGE and Nomic are unavailable, the platform uses deterministic text hashing and lexical overlap. That fallback is lower quality but stable, local, and explicit in artifacts.