First Thoughts on Artificial Intelligence and Genetic Engineering
Here are some of my first impressions on the race to utilize artificial intelligence for genetic engineering. The reason I feel the need to write this is to come to a better understanding of where artificial intelligence currently fits within biological research and where I think it may take us in the future. These are not meant to be definitive conclusions. Rather, they are simply my thoughts as I try to make sense of the current state of the field.
AlphaFold
I think that, of the many developments in artificial intelligence, one of the most impactful applications has been AlphaFold. This protein structure prediction program has dramatically shortened the time required to generate structural predictions for proteins. Before tools like AlphaFold, determining a protein’s structure often required years of experimental work using techniques such as X-ray crystallography or cryo-electron microscopy. Today, researchers can often begin with a highly accurate computational prediction in a matter of hours.
This does not mean that biology has suddenly become easy. Proteins still need to be synthesized, characterized, and experimentally validated. However, AlphaFold has significantly reduced one of the major bottlenecks in the research process. Instead of spending years simply trying to determine what a protein looks like, scientists can spend more of their time asking what that protein actually does and how it can be engineered or manipulated.
To me, AlphaFold represents one of the first major examples of artificial intelligence fundamentally changing how biological research is performed. Rather than replacing scientists, it has allowed researchers to spend less time solving one difficult problem and more time asking deeper scientific questions.
The Data
One thing that I think is absolutely essential to the machine learning revolution is the generation and acquisition of high-quality data. At the end of the day, machine learning models learn from reality—or at least from our best representation of it. If we can capture high-fidelity observations of biological systems, then those observations become the foundation upon which these models can learn.
For me, it always comes back to the idea of garbage in, garbage out. If a model is trained on poor-quality or incomplete data, then we should not be surprised when it produces poor-quality predictions. No amount of compute or clever model architecture can completely overcome bad data.
This is one of the reasons why I think experimental researchers will continue to play such a critical role. Before a model can learn anything meaningful, someone has to perform the experiments, develop new assays, collect measurements, and carefully validate the results. Biology does not simply produce datasets on its own. Those datasets are the result of years of work by scientists trying to understand highly specialized biological systems.
What I continue to wonder, however, is whether we can eventually build models that are much more informed by the underlying physics of biological systems. Rather than simply memorizing patterns from enormous datasets, could they learn the physical principles that govern how proteins, cells, and entire organisms behave? We may never be able to perfectly simulate every biological process—after all, we do not know what we do not know—but perhaps we can build models that better capture the rules we do understand.
If we can continue improving both our experiments and our models, then each can strengthen the other. Better experiments produce better data. Better data produces better models. Better models suggest better experiments. Over time, that feedback loop may be what truly accelerates biological discovery, allowing us to observe reality more clearly and solve problems that today still seem out of reach.
Looking Forward
I know that physics-based models are actively being developed, and that is something I will have to spend more time learning about as I continue down this path. At the moment, however, my plate is already full. Right now, I am focused on learning how to take protein data and molecular dynamics simulations and incorporate them into artificial intelligence workflows. There is already plenty to learn there before I start venturing further into physics-informed machine learning.
Well, that is it for this particular check-in. I think my next step will be to start building a knowledge graph so that I can identify the gaps in my understanding and map out the best way to fill them.
Ultimately, that is one of the purposes of this blog. It is a place for me to write down my thoughts, document my understanding as it evolves, and keep myself intellectually honest. I hope that, over time, these posts will not only show what I have learned but also capture the process of becoming a better researcher.
Continue the conversation