Skip to Content

Reading the language of cells

Although single-cell analysis has dramatically advanced medical research, it also produces massive amounts of data that pose significant challenges. 

To address this, researchers from Yale Engineering, Yale School of Medicine, and Google’s DeepMind have developed Cell2Sentence (C2S)— a system that transforms complex multi-omic data into a structured text format that can be read by large language models (LLMs), enabling large language models to process and interpret biological datasets.

“We converted genomic data, single-cell expression data, bulk RNA-seq data, and other modalities into a textual format,” said David van Dijk, assistant professor of medicine and of computer science. “Then we fine-tuned existing LLMs—originally trained on natural language—to understand biological ‘language.’”

Headshot of David van Dijk.

There’s a universal ‘grammar’ underlying both [human language and biology] — it’s all based on logic and rules.

David Van Dijk
assistant professor of medicine and of computer science

Van Dijk led a multidisciplinary team that includes graduate students and postdoctoral associates in computer science, medicine, and statistics and data science. A key to C2S’s success is the parallel between human language and biology. 

“There’s a universal ‘grammar’ underlying both systems—it’s all based on logic and rules,” he explained. “In language, it’s syntax and grammar; in biology, it’s gene regulatory networks. We saw that language-based models perform far better when trained this way.”

For example, researchers can feed an experimental dataset into C2S and ask: “Explain this data: What cell types are present? Which disease condition does it suggest?”

C2S will then generate a clear, natural-language summary. “We can train these models for many tasks,” van Dijk noted. “Our approach lets us combine biological data—gene expression, regulatory information—and clinical context.”

Today, van Dijk envisions C2S as a research tool for academia and the pharmaceutical industry.

“It's about drug discovery and dissecting disease mechanisms,” he said. “We can ask, ‘What happens if we apply this drug?’ or ‘What if we knock out this gene?’”

Most work happens in silico – that is, as computer simulations; promising results are later validated in the lab. Looking further ahead, C2S could underpin a true “digital twin”—a complete, patient-specific biological simulation.

“You could simulate various treatments, predict outcomes and then choose the therapy likely to work best,” he said.

More Details

Published Date

May 20, 2025

Featured Departments