Untitled

👀 Problem

The process of producing patients report with genome sequencing results relies on a simple comparison between the genome variants of the patients and standard database such as ClinVar and ClinGen. This approach has two disadvantages: 1. the current approach does not look at the non-encoding region, while many genetic diseases could be caused by the mutation in the non-encoding region of genome. 2. Manually comparing the patients’ genome with the database could be both sub-optimal and laborious.

DNA comprises a sequence of four bases: adenine (A), thymine (T), cytosine (C), and guanine (G). The ability to generate specific DNA sequences paves the way for myriad applications, notably in the creation of functional biotherapeutics, such as vaccines for conditions like COVID and cancer.

On the other hand, the decoder-only model with transformers has demonstrated its power in modeling a variety of sequences, from textual data to proteins. Generally speaking, these uni-modal models for different data types can be classified as Language Models (LM). While LMs for text can have their generation conditioned via prompts, the same is not straightforward for non-textual sequences.

The core focus of this research project is to tap into the potential of LMs for DNA sequence engineering, serving as a bridge between textual and biological sequences. Specifically, our objectives are:

Developing methods to condition LMs for DNA sequences, allowing for controlled and targeted DNA sequence generation.
Refining and fine-tuning existing LMs to be apt for DNA sequences.

Furthermore, as an extended goal, we will investigate fundamental LM pretraining aspects, including the tokenisation of input sequences.

Untitled