Open problems

We have curated a set of problems that we believe are key to advancing from narrow to general virtual cells.

We frequently think and work around these problems, but the opportunity space is much larger than what we can launch formal collaborations on.
Therefore, our idea was to publish these and make the following commitment, so you are free to try your hand at them even as an individual and small group:

If you can demonstrate meaningful progress (even at small-scale) on any of the problems below, we are happy to support your work with:

access to our internal, curated datasets
technical feedback and discussion from a mentor with industry experience
help validating results in real-world settings

This comes with no fees, but also no funding and no IP transfer.
Your research stays your own, we just help supercharge your findings.

What counts as progress?

Clear, measurable improvement compared to a baseline model an experienced computational biologist would build, in a generalization setting that maps well to how the model would be applied in the real-world.

PDX effect prediction from in vitro data

Patient-derived xenograft (PDX) models are the dominant preclinical model systems in oncology. However, how a PDX experiment is performed is very different from how the same treatment is administered in vitro. Simply aligning the PDX and in vitro transcriptomes using current alignment methods removes not just the batch effect coming from the difference between the experimental protocols, but the true difference in the two microenvironments as well. This is most likely necessary to go beyond the current state of the art.

Success is an improvement in PDX models' growth rate prediction performance in a treatment-exclusive split (all PDX models can be in the train set as well as all in vitro data - but the tested drug could not be present in the PDX data with any modality. (The scenario we model here is which PDX models should a pharma company test with their new molecule which looks promising in vitro.

Multimodal synergy

The more data modalities, the better - one would say. Clearly, having a better understanding of the cell state should give us a more precise understanding of the hidden biological context in which the experiments have been performed.

In most cases what works is filling the blanks - being able to predict the missing modalities where at least one measurement has been made. While this can be useful in some circumstances, the real target is to use multi-modal data to improve prediction performance beyond where any measurements were made. Given the plethora of possible modes, there are surely a lot of modality combinations which are synergistic in this sense, but no one ever tried training them together.

Success is a measurable improvement in in vitro cell-line exclusive (CEX) or perturbation-exclusive (PEX) phenotype prediction performance of the model compared to what the best single modality model could do.

Better pretraining

Pretraining, e.g. using native data to teach the model general rules of biology, is a good idea - but it doesn't yet work. That is, no amount of native data was ever shown to significantly improve perturbation prediction performance. That could be due to at least two reasons - either native single-cell RNASeq is not the right data to use for pretraining, or the pretraining task itself is too easy.

We're happy to support any approach that can show an improvement on in vitro cell-line exclusive (CEX) or perturbation-exclusive (PEX) phenotype prediction performance using any kind of native data.

RNASeq to phenotype prediction

In an ideal world, abundant public RNASeq data could serve as the glue connecting many different downstream assays enabling all kinds of applications to learn from each other. In practice, added benefit from public Drug-Seq or Peturb-Seq data is small.

We are happy to support anyone who can demonstrate robust performance improvement on zero-shot (cell-, or treatment-exclusive) phenotype prediction from unrelated RNASeq data.

Biological flows

Over short time horizons, the biological landscape of a cell (as defined by its DNA) can be assumed to be time invariant. That is a strong regularization opportunity yet unexplored. Could we improve on perturbation prediction performance with an iterative architecture? Could this system connect measurement data taken over different time horizons?

As usual, success is a measurable improvement in in vitro cell-line exclusive (CEX) or perturbation-exclusive (PEX) phenotype prediction performance of the model compared to what a non-iterative model would do given the same data.