Click one of the three example buttons to pre-fill the form with a different input style. Each represents a common user scenario.
Example by input type:
This form runs live inference. The paper's 244,002 held-out 5-fold CV
predictions for every gene in the training corpus are on
Zenodo
and reproducible via the
Colab quickstart.
Host proteome calibration
Every blue (non-conserved) and gold (conserved) dot is a test-split gene in this host.
The x-axis shows its PaxDb/Abele measured abundance (per-species z-score);
the y-axis shows Aiki-XP's held-out Tier D prediction for that gene.
A tight cloud along the diagonal means Aiki-XP is well-calibrated for this host.
Your submitted gene will appear as a bright gold marker once the prediction completes.
Host —Test genes —ρoverall—ρnon-mega—
Your prediction
The triangle marks your gene on the host's per-species expression distribution. Genes to the right are predicted high expressers for that host.
Tier —Host —Mode —Operon —Operon length — nt
Gene
Predicted z-score
Host percentile
Interpretation
Dissect this prediction: confidence, biophysics, outlier flags
Reliability
—
—
Biophysical snapshot
In-distribution check
—
What drove this prediction
Leave-one-out ablation of each Tier D modality. Bar length = change in predicted z-score if that modality is zero-filled. Positive bars mean the modality contributed positively to the final prediction.
Everything here is computed from data already in the response or from your protein/CDS string. Modality attribution uses additional GPU calls (auto-computed on Tier D runs).
Similar proteins in the 492K corpus
The five nearest neighbours by ESM-C embedding cosine similarity. Each row shows the paper's measured truth and held-out Tier D prediction: a direct trust indicator for how the model behaves on sequence-space neighbours of your gene.
Neighbour
Species
Similarity
PaxDb truth
Tier D CV
The tier ladder for this gene
Each tier adds a modality. The gap from Tier A to Tier D is how much operon + genome context is worth on this sequence.
Tier
Modalities
Predicted z
Δ vs Tier A
Latency
Predicted 3D structure
ESMFold v1 (Meta AI) structure prediction for your protein, served via the public ESMFold Atlas API.
Colouring by per-residue pLDDT confidence.
● very high (≥90)
● confident (70–90)
● low (50–70)
● very low (<50)
Where your protein sits in the XP5 embedding landscape
40,000 corpus genes plotted in the paper's fused UMAP space (Fig. 3 /
si_umap_anatomy). Points coloured by held-out Tier D
prediction (gold = high expresser, blue = low, grey = training split).
Your protein's position is inferred from its 10 nearest ESM-C neighbours
and shown as a gold star (approximate projection, not an exact UMAP).
Anyone opening the link reproduces this exact predictionCopied!
492,026
Genes
across 385 bacterial species
ρnc=0.592
Non-conserved
gene-operon holdout, 5-fold CV
5
Deployment Tiers
protein → operon → genome
1,831
Host Genomes
available as operon context
How Aiki-XP works
Three steps from a protein sequence to a ranked prediction.
1
Submit a protein + its CDS
Paste the amino-acid sequence and the coding-DNA sequence that produces it. Pick the bacterial host you plan to express in (any of 1,831 reference genomes).
2
Extract five modalities
Aiki-XP places your CDS in its genomic neighborhood (native gene, or wrapped in the host's lacZ context for heterologous expression), then extracts five biological modalities: genome context (Bacformer-large), operon architecture (Evo-2 7B), CDS composition (HyenaDNA), protein identity (ESM-C), and per-gene biophysical features.
3
Rank within the proteome
A 25M-parameter fusion head returns a per-species z-score: where your gene sits on the host's expression distribution. Above-median suggests a favorable expression candidate; low suggests unfavorable.
What this predicts. Aiki-XP ranks expression candidates within a host's proteome.
It reports whether your gene is a high, medium, or low expresser relative to other genes in the
selected host. It does not predict absolute protein yield in µg/mL. For choosing among
multiple candidates in a given host, rank is the right signal; for absolute yield, wet-lab
screening remains necessary.
Where Aiki-XP works, and where it doesn't
Known limits of this release.
✓ Works well for
Ranking native bacterial proteins within their own host genome (Spearman ρnc=0.59).
Heterologous candidates where the target host is one of our 1,831 cached bacterial genomes.
Sequences in the same length range as the training set (~50-2000 aa); works best at ~150-800 aa.
Comparisons among candidates: absolute z is noisy, but rank order within a design library is the signal the paper evaluates.
⚠ Use with care for
Hosts outside the cached 1,831 genomes. Pick the closest phylogenetic neighbour, or submit the NCBI accession and the typeahead will add it.
Point mutants of a known protein. The tier ladder is designed for rank-order over diverse sequences, not fine-grained effect sizes of single substitutions.
Proteins with unusual features (membrane-spanning regions, disulfide-heavy, heavily glycosylated in the native context). Training coverage is thin for these classes.
Individual z-scores read as calibrated absolute numbers. Expect MAE ≈ 0.47 per non-conserved gene, 95% of predictions within |Δ|<1.5.
× Will not work for
Eukaryotic hosts (yeast, CHO, plant, mammalian). This model is pan-bacterial only; retraining on eukaryotic corpora is out-of-distribution.
Absolute expression prediction in µg/mL. The model predicts per-species rank, not yield; use wet-lab measurement for absolute values.
Tagged sequences (His6/His10, HiBit, FLAG, GST, SUMO) scored with the tag intact. Strip tags first (aikixp.sequence_normalization). The model is trained on untagged native proteins.
Cell-free or synthetic expression systems. Aiki-XP is trained on in-vivo PaxDb and Abele atlases; reactions in the PURE system or yeast-lysate TXTL are out of distribution.
Toxic proteins, or ones the host cannot fold. The model is a rank-order predictor of expression potential; it does not account for post-translation degradation, aggregation, or host death.
For the full failure-mode analysis, including per-domain calibration and the external-validation benchmarks where we explicitly lose to specialized tools, see §4 (Limitations) of the paper.
Compute footprint. Warm per-request latency: Tier A 15 s (CPU), Tier D 5 to 10 s (A100).
Cold-start 30 to 120 s after ~5 min idle. Approx cost: $0.001 per request for Tier A, $0.004 to $0.006 for Tier D.
Paper training totals: XP5 360 A100-h (5 folds × 72 h) plus embedding extraction ~1000 A100-h across Evo-2, Bacformer, ESM-C, ProtT5, and HyenaDNA. Full traces on Zenodo.
Reproducibility. The paper's 244,002 held-out 5-fold CV
predictions are on Zenodo (10.5281/zenodo.19639621).
To reproduce the headline per-fold Spearman ρ_nm = 0.590 ± 0.012 (Tier D)
against the full corpus, run the
Colab quickstart notebook
(free-tier, CPU, ~3 min).
Abstract
From the bioRxiv preprint, verbatim.
Generalizable protein-expression prediction can accelerate protein engineering, inform disease mechanisms, and help optimize heterologous recombinant protein production. Protein expression is governed by many interacting parameters that no single omics view captures. We develop Aiki-XP, a multimodal fusion platform integrating four biological scales (genome context, operon architecture, coding-sequence composition, protein identity) plus per-gene biophysical features across 492,026 genes from 385 bacterial species. Aiki-XP predicts within-species relative abundance (per-species z-score rank), not absolute copies per cell. Under a leakage-controlled gene-operon split Aiki-XP reaches Spearman ρnc=0.592 on non-conserved genes versus 0.509 for ESM-C 600M alone, and each tier of a monotone protein → operon → genome deployment ladder yields a statistically significant gain. All recipes were locked before external evaluation; transfer to heterologous, cross-species, and novel-phylum benchmarks demonstrates utility and limits. Ablation and scaling experiments identify operon-scale genomic context, not protein-language-model capacity, as the rate-limiting input for bacterial expression prediction at this scale.
Glossary of terms
ρ (Spearman rank correlation)
A measure between -1 and 1 of how well the model's predicted ranking agrees with the true ranking of gene expression. 1 = perfect agreement, 0 = no correlation.
ρnc (non-conserved Spearman)
The same metric, but computed only on genes that do not belong to conserved protein families spanning many species (e.g. ribosomal proteins). This prevents the model from getting credit for recognizing family identity.
Per-species z-score
Each gene's expression is re-expressed as how many standard deviations above (or below) the mean of its own host species. This removes systematic differences between species and between proteomics platforms.
Operon
A cluster of bacterial genes co-transcribed from a single promoter into a single mRNA. Neighboring genes in the same operon often share regulatory fate.
Tier D, a.k.a. XP5 (paper configuration)
The full 5-modality Aiki-XP model. The paper calls this configuration "XP5" because it fuses 5 biological modalities (genome context via Bacformer-large, operon architecture via Evo-2 7B, CDS composition via HyenaDNA, protein identity via ESM-C, and per-gene biophysical features, where the biophysical modality aggregates codon usage, protein properties, disorder, operon structure, and RNA folding) and is evaluated under 5-fold cross-validation.
Native vs. heterologous
Native: your CDS exists in the chosen host genome, so Aiki-XP uses its real operon context. Heterologous: your CDS is foreign to the host; we wrap it in the host's lacZ promoter-CDS-terminator context, reflecting a common recombinant-expression protocol.
Gene-operon split
The leakage-control scheme used to evaluate Aiki-XP. Genes are grouped into MMseqs2 clusters and partitioned at the cluster level between train and test, so no test gene shares a protein family with any training gene.
Research collaborations, enterprise deployment, custom host additions, or anything else.
Built on the work of many
Aiki-XP would not exist without the foundation models, datasets, and infrastructure released by these teams. If you're building on top of Aiki-XP, please cite them too.
Infrastructure
Modal Labs: serverless GPU hosting. Every prediction on this page runs on Modal containers; their startup-program credits made the live demo possible.
Zenodo: permanent data and model-weights archive (CC-BY 4.0).
Each upstream model and dataset retains its own licence. Users deploying Aiki-XP, its outputs, or derived predictions in their own workflows are responsible for complying with the respective upstream terms; follow the links above for each.