Aikium Inc. · Nature Communications 2026 (submitted)

Aiki-XP

Leakage-controlled multimodal prediction of within-species relative protein expression at pan-bacterial scale.

ρ_nc = 0.592 492,026 genes 385 species 1,831 host genomes 5 deployment tiers

Try Aiki-XP

Click one of the three example buttons to pre-fill the form with a different input style. Each represents a common user scenario.

Example by input type:

This form runs live inference. The paper's 244,002 held-out 5-fold CV predictions for every gene in the training corpus are on Zenodo and reproducible via the Colab quickstart.

Protein sequence ?

Coding DNA sequence (CDS) ?

Host organism ?

Currently: Escherichia coli K12 (NC_000913.3)

If your organism isn't listed, type an NCBI accession in the search box (e.g. NC_000913.3); when nothing matches, the page offers to download it on demand.

Expression context ?

Native gene

Heterologous

Native uses your CDS's real operon; heterologous wraps it in lacZ.

Prediction tier ?

Tier A · ρ_nc=0.518

Tier D (XP5) · ρ_nc=0.592

Tier D uses all five modalities (protein + CDS + host genome).

Your prediction

The triangle marks your gene on the host's per-species expression distribution. Genes to the right are predicted high expressers for that host.

Tier — Host — Mode — Operon — Operon length — nt

Gene	Predicted z-score	Host percentile	Interpretation

492,026

Genes

across 385 bacterial species

ρ_nc=0.592

Non-conserved

gene-operon holdout, 5-fold CV

Deployment Tiers

protein → operon → genome

1,831

Host Genomes

available as operon context

How Aiki-XP works

Three steps from a protein sequence to a ranked prediction.

Submit a protein + its CDS

Paste the amino-acid sequence and the coding-DNA sequence that produces it. Pick the bacterial host you plan to express in (any of 1,831 reference genomes).

Extract five modalities

Aiki-XP places your CDS in its genomic neighborhood (native gene, or wrapped in the host's lacZ context for heterologous expression), then extracts five biological modalities: genome context (Bacformer-large), operon architecture (Evo-2 7B), CDS composition (HyenaDNA), protein identity (ESM-C), and per-gene biophysical features.

Rank within the proteome

A 25M-parameter fusion head returns a per-species z-score: where your gene sits on the host's expression distribution. Above-median suggests a favorable expression candidate; low suggests unfavorable.

What this predicts. Aiki-XP ranks expression candidates within a host's proteome. It reports whether your gene is a high, medium, or low expresser relative to other genes in the selected host. It does not predict absolute protein yield in µg/mL. For choosing among multiple candidates in a given host, rank is the right signal; for absolute yield, wet-lab screening remains necessary.

Where Aiki-XP works, and where it doesn't

Known limits of this release.

✓ Works well for

Ranking native bacterial proteins within their own host genome (Spearman ρ_nc=0.59).
Heterologous candidates where the target host is one of our 1,831 cached bacterial genomes.
Sequences in the same length range as the training set (~50-2000 aa); works best at ~150-800 aa.
Comparisons among candidates: absolute z is noisy, but rank order within a design library is the signal the paper evaluates.

⚠ Use with care for

Hosts outside the cached 1,831 genomes. Pick the closest phylogenetic neighbour, or submit the NCBI accession and the typeahead will add it.
Point mutants of a known protein. The tier ladder is designed for rank-order over diverse sequences, not fine-grained effect sizes of single substitutions.
Proteins with unusual features (membrane-spanning regions, disulfide-heavy, heavily glycosylated in the native context). Training coverage is thin for these classes.
Individual z-scores read as calibrated absolute numbers. Expect MAE ≈ 0.47 per non-conserved gene, 95% of predictions within |Δ|<1.5.

× Will not work for

Eukaryotic hosts (yeast, CHO, plant, mammalian). This model is pan-bacterial only; retraining on eukaryotic corpora is out-of-distribution.
Absolute expression prediction in µg/mL. The model predicts per-species rank, not yield; use wet-lab measurement for absolute values.
Tagged sequences (His6/His10, HiBit, FLAG, GST, SUMO) scored with the tag intact. Strip tags first (aikixp.sequence_normalization). The model is trained on untagged native proteins.
Cell-free or synthetic expression systems. Aiki-XP is trained on in-vivo PaxDb and Abele atlases; reactions in the PURE system or yeast-lysate TXTL are out of distribution.
Toxic proteins, or ones the host cannot fold. The model is a rank-order predictor of expression potential; it does not account for post-translation degradation, aggregation, or host death.

Compute footprint. Warm per-request latency: Tier A 15 s (CPU), Tier D 5 to 10 s (A100). Cold-start 30 to 120 s after ~5 min idle. Approx cost: $0.001 per request for Tier A, $0.004 to $0.006 for Tier D. Paper training totals: XP5 360 A100-h (5 folds × 72 h) plus embedding extraction ~1000 A100-h across Evo-2, Bacformer, ESM-C, ProtT5, and HyenaDNA. Full traces on Zenodo.

Reproducibility. The paper's 244,002 held-out 5-fold CV predictions are on Zenodo (10.5281/zenodo.19639621). To reproduce the headline per-fold Spearman ρ_nm = 0.590 ± 0.012 (Tier D) against the full corpus, run the Colab quickstart notebook (free-tier, CPU, ~3 min).

Abstract

From the bioRxiv preprint, verbatim.

Generalizable protein-expression prediction can accelerate protein engineering, inform disease mechanisms, and help optimize heterologous recombinant protein production. Protein expression is governed by many interacting parameters that no single omics view captures. We develop Aiki-XP, a multimodal fusion platform integrating four biological scales (genome context, operon architecture, coding-sequence composition, protein identity) plus per-gene biophysical features across 492,026 genes from 385 bacterial species. Aiki-XP predicts within-species relative abundance (per-species z-score rank), not absolute copies per cell. Under a leakage-controlled gene-operon split Aiki-XP reaches Spearman ρ_nc=0.592 on non-conserved genes versus 0.509 for ESM-C 600M alone, and each tier of a monotone protein → operon → genome deployment ladder yields a statistically significant gain. All recipes were locked before external evaluation; transfer to heterologous, cross-species, and novel-phylum benchmarks demonstrates utility and limits. Ablation and scaling experiments identify operon-scale genomic context, not protein-language-model capacity, as the rate-limiting input for bacterial expression prediction at this scale.

Glossary of terms

ρ (Spearman rank correlation): A measure between -1 and 1 of how well the model's predicted ranking agrees with the true ranking of gene expression. 1 = perfect agreement, 0 = no correlation.
ρ_nc (non-conserved Spearman): The same metric, but computed only on genes that do not belong to conserved protein families spanning many species (e.g. ribosomal proteins). This prevents the model from getting credit for recognizing family identity.
Per-species z-score: Each gene's expression is re-expressed as how many standard deviations above (or below) the mean of its own host species. This removes systematic differences between species and between proteomics platforms.
Operon: A cluster of bacterial genes co-transcribed from a single promoter into a single mRNA. Neighboring genes in the same operon often share regulatory fate.
Tier D, a.k.a. XP5 (paper configuration): The full 5-modality Aiki-XP model. The paper calls this configuration "XP5" because it fuses 5 biological modalities (genome context via Bacformer-large, operon architecture via Evo-2 7B, CDS composition via HyenaDNA, protein identity via ESM-C, and per-gene biophysical features, where the biophysical modality aggregates codon usage, protein properties, disorder, operon structure, and RNA folding) and is evaluated under 5-fold cross-validation.
Native vs. heterologous: Native: your CDS exists in the chosen host genome, so Aiki-XP uses its real operon context. Heterologous: your CDS is foreign to the host; we wrap it in the host's lacZ promoter-CDS-terminator context, reflecting a common recombinant-expression protocol.
Gene-operon split: The leakage-control scheme used to evaluate Aiki-XP. Genes are grouped into MMseqs2 clusters and partitioned at the cluster level between train and test, so no test gene shares a protein family with any training gene.

Partner with us → partnerships@aikium.com

Research collaborations, enterprise deployment, custom host additions, or anything else.

Built on the work of many

Aiki-XP would not exist without the foundation models, datasets, and infrastructure released by these teams. If you're building on top of Aiki-XP, please cite them too.

Infrastructure

Modal Labs: serverless GPU hosting. Every prediction on this page runs on Modal containers; their startup-program credits made the live demo possible.
Zenodo: permanent data and model-weights archive (CC-BY 4.0).
GitHub: source code, issues, discussions (Apache 2.0).

Foundation models

ESM-C (EvolutionaryScale)
ProtT5-XL (Elnaggar et al.)
HyenaDNA (Nguyen et al.)
Evo-2 (Brixi, Hsu, Hie et al., Arc Institute)
Bacformer (Wiatrak, Weimann, Floto et al.)
ViennaRNA (Lorenz et al.)

Data

PaXDb v6.0 (Huang et al. 2025): integrated quantitative proteomics
Abele et al. 2025: bacterial proteomics atlas (Mol. Cell. Proteomics, MassIVE MSV000096603)
NCBI RefSeq: 4,566 reference bacterial genomes

Each upstream model and dataset retains its own licence. Users deploying Aiki-XP, its outputs, or derived predictions in their own workflows are responsible for complying with the respective upstream terms; follow the links above for each.

Aiki-XP

Try Aiki-XP

Host proteome calibration

Your prediction

Similar proteins in the 492K corpus

The tier ladder for this gene

Predicted 3D structure