Skip to content
Protean LabsDocs

Entity Extraction

Entity extraction turns local scientific records into structured review signals. It is additive: it augments ingestion and evidence organization, but it does not replace deterministic validators or LLM extraction.

Protean can use urchade/gliner_large-v2 when the local GLiNER runtime is available. If the runtime package is unavailable, deterministic field extraction and regex fallback remain active.

Entity Families

The extractor is configured for scientific and peptide-adjacent entities:

  • peptide names
  • sequence-like strings
  • assay names
  • proteases
  • organisms
  • route of administration terms
  • degradation and stability terms
  • permeability terms
  • toxicity terms
  • failure signals

Output

data/processed/entities/entities_latest.jsonl

Entity records help the retrieval layer and paper generator identify relevant assay context, protease language, degradation signals, and failure vocabulary.

Limits

Entity extraction can miss entities, over-select terms, or require local package support. It is a structuring aid, not a scientific claim engine.