Entity Extraction
Entity extraction turns local scientific records into structured review signals. It is additive: it augments ingestion and evidence organization, but it does not replace deterministic validators or LLM extraction.
Protean can use urchade/gliner_large-v2 when the local GLiNER runtime is available. If the runtime package is unavailable, deterministic field extraction and regex fallback remain active.
Entity Families
The extractor is configured for scientific and peptide-adjacent entities:
- peptide names
- sequence-like strings
- assay names
- proteases
- organisms
- route of administration terms
- degradation and stability terms
- permeability terms
- toxicity terms
- failure signals
Output
data/processed/entities/entities_latest.jsonl
Entity records help the retrieval layer and paper generator identify relevant assay context, protease language, degradation signals, and failure vocabulary.
Limits
Entity extraction can miss entities, over-select terms, or require local package support. It is a structuring aid, not a scientific claim engine.
