Ingestion Method

Protean Labs treats ingestion as the first control surface in autonomous discovery. The goal is not to collect the largest possible corpus. The goal is to convert selected scientific records into traceable evidence that can safely influence constraints, candidate generation, ranking, and review.

The ingestion layer is designed around provenance, normalization, deduplication, and negative evidence capture. It creates a disciplined signal plane before any candidate is allowed to advance.

Source Classes

Protean’s public source categories include curated biological sequence records, biomedical literature, open publication metadata, and clinical-stage peptide context.

The system currently supports source classes such as:

UniProt-style sequence and annotation records.
PubMed and NCBI literature metadata.
Europe PMC abstracts and publication availability metadata.
ClinicalTrials.gov study metadata for peptide-related context.
Internal research observations and failure records.

These sources are not treated as equivalent truth. Each record enters with source identity, query context, timestamps, lineage, and extraction confidence so downstream systems can reason about provenance instead of flattening everything into one undifferentiated text pool.

Evidence Flow

source selection
-> rate-limited retrieval
-> raw record preservation
-> dedupe and fingerprinting
-> structured extraction
-> evidence normalization
-> negative-signal capture
-> constraint and ranking inputs

Raw records are preserved before extraction. Normalized evidence is then shaped into a fixed internal contract that can be inspected by ranking, explanation, and failure-memory systems.

Normalization Contract

Each normalized record is shaped to preserve the properties that matter for research orchestration:

Source type and source identifier.
Query or acquisition context.
Candidate sequence mentions when conservatively detected.
Biological or formulation context when available.
Positive, neutral, and negative evidence language.
Failure cues such as degradation, cleavage, instability, poor solubility, low permeability, short half-life, terminated development, or weak translation signals.
Traceability back to the originating record.

The extraction layer is intentionally conservative. If a sequence or claim cannot be parsed with sufficient confidence, it is kept as context rather than promoted into candidate evidence.

Negative Evidence Capture

Failure data is not a side note. It becomes part of the optimization surface.

During ingestion, Protean looks for structured signs of peptide failure: proteolytic degradation, gastric or intestinal instability, low oral exposure, poor permeability, poor solubility, immunogenicity signals, short half-life, delivery limitations, and abandoned or terminated development paths.

Those records can influence:

Failure motif memory.
Candidate warnings.
Ranking penalties.
Constraint refinement.
Explanation context.
Bounded learning signals.

Scientific posture

Ingested evidence is not treated as validation. It is structured context for prioritization and review.

Dedupe And Provenance

Ingestion uses source identifiers, record keys, and fingerprints to avoid repeatedly counting the same source material. This matters because autonomous loops can otherwise inflate the influence of repeated records and create false confidence.

Protean keeps dedupe as a first-class ingestion concern:

Previously seen records are recognized.
Refreshed records can be processed intentionally.
Duplicate mentions do not automatically become stronger evidence.
Source lineage remains visible to review systems.

Model-Assisted Extraction

Local reasoning models may assist with extraction when available, but they do not become the source of truth. Model-assisted extraction is bounded by schema, timeout, routing, and deterministic fallback behavior.

If model extraction is unavailable or uncertain, the system continues through conservative rule-based extraction. This protects the evidence layer from becoming dependent on a single generative route.

Downstream Use

The ingestion layer feeds the rest of the platform in four ways:

Constraint generation receives evidence-derived design boundaries.
Candidate generation receives stable examples, failure examples, and sequence context.
Ranking receives failure proximity, novelty, and review signals.
Explanations receive source-aware context for why a candidate advanced or stalled.

This is the public shape of the ingestion system. Protean’s proprietary advantage lives in how the organization curates, weights, and operationalizes these signals over time.

Previous$PRTN Architecture NextAutonomous Infrastructure