Solving a 30 million piece exome jigsaw puzzle

PuzzleOpEx3One of the four overarching remits of the TGMI is to examine how clinical testing processes are best configured to maximise utility, accuracy, scalability and affordability of genetic medicine.

We are then using this knowledge to develop and validate flexible tools, processes and pipelines that maximise the research and clinical utilities of genetic testing.

Today we are announcing the first of these pipelines – OpEx. OpEx is an easy-to-use, press-and-play, freely-available genetic data analysis pipeline developed and validated for the clinical setting.


Testing all genes – exome sequencing

The new DNA sequencing technologies have made it fast and affordable to read the DNA code of all genes, in one go. This is often called ‘exome sequencing’ because genes are made up of blocks called ‘exons’ and the umbrella term for all exons of all genes is the ‘exome’. The exome includes about 200,000 exons in about 20,000 genes.

“At the end of the data generation part of an exome test you are left with ~30,000,000 fragments of DNA code that need to be pieced back together.”

The exome takes up less that 2% of the whole genome, but that is still ~35,000,000 letters of DNA code. Most exome sequencing involves cutting this up into millions of more manageable, overlapping fragments of 100 letters (called ‘reads’), and then reading the actual letters of code in each of them.

This means that at the end of the data generation part of an exome genetic test you are left with ~30,000,000 fragments of DNA code that need to be pieced back together.

It’s like solving a 30 million piece jigsaw puzzle.


Building the exome jigsaw

When you do a jigsaw puzzle you have a painting on the front of the box to help you put the pieces in the right places. It is similar with the exome jigsaw – you use the human genome sequence, first built by the Human Genome Project, as the reference picture to piece back together the exome of the person being tested.

There are many, many challenges and complexities in doing an exome jigsaw puzzle. For example, the reference picture is very similar, but not identical, to each person’s exome – more a Monet than a Manet depiction!  The parts that are not identical are also the very parts where disease-causing changes will be hiding out. So it’s critical that every piece is read as accurately as possible and that pieces that are different from the reference are robustly identified.

A second problem is that there are always some missing pieces. It is obviously important to have as few missing pieces as possible, and this is continuing to improve. But it is also very important to know which pieces are missing. We naturally need to review a negative gene test result more carefully if some pieces of relevant genes are missing.


Making a robust exome sequencing analysis pipeline for the clinic

Many different processes must be performed when putting together the pieces of an exome jigsaw and many different tools are available for each of these processes. Moreover, many different versions of each tool are also often available and these can perform slightly, or sometimes substantially, differently.

“OpEx is an easy-to-use, press-and-play, freely-available genetic data analysis pipeline developed and validated for the clinical setting.”

The layers of extensive variability make evaluation of exome analysis pipeline performance very challenging, and comparisons of different pipelines nearly impossible.

For exome testing in the clinic this is a particular challenge. Most of the available tools were not developed with the clinic in mind, and require modifications to meet clinical needs and standards. The myriad complexities of constructing, testing and running a clinical exome analysis pipeline can add time, cost and staffing demands to genetic testing.


OpEx is a validated, automated pipeline for exome sequence analysis

To address these issues we have developed and tested an end-to-end, press-and-play pipeline that puts together the exome jigsaw, identifies all the pieces that are different from the reference and provides comprehensive information about the presence and quality of the pieces.

The pipeline is called OpEx. This is short for ‘Optimised Exome analysis’

We specifically focused on making OpEx simple to use, the whole pipeline is set-off from a single command. But there is also in-built flexibility so users can modify any of the components, should they wish to. The pipeline and all the components are freely available here .

The performance of OpEx is excellent. Particularly for the small insertions and deletions of DNA that are tricky, but crucial, to read because of their major role in causing human disease.

Our paper describing OpEx performance and the testing we have done was published this week.

We have tested the pipeline with publicly-available datasets that we previously released: the ICR1000 exome series and the ICR142 NGS validation series. This allows anyone to recreate exactly what we have done, should they wish to. Potential OpEx users can also use the same datasets in other pipelines to make direct comparisons with their existing set-up.

We have written detailed documentation, which is available as pdf, to again facilitate such comparisons. It is essential that fully comprehensive, transparent information about analysis pipelines are provided by everyone doing this type of testing. Without it we cannot ensure that everyone is reaching the required standards and our ability to deliver genetic medicine will be compromised, in multiple ways.


Who is OpEx for?

We hope OpEx will prove useful to many different research and clinical testing laboratories. We believe it will be particularly useful to the rapidly increasing number of labs seeking a fully developed, validated pipeline that requires no specialised informatic expertise to set it up, and minimal user input to run it.


What is the TGMI doing next?

We are now planning evaluations of OpEx in different labs and in comparison with different tools. We would be delighted to hear from any one interested in potentially using or evaluating OpEx. Feel free to download and play with OpEx or to contact us directly to discuss this.

The TGMI is also developing other tools and resources to make genetic medicine processes easier, better and more standardised.

Please contact us or comment below if you have encountered difficulties in exome analysis, or if you have suggestions for tools and processes in genetic medicine that the TGMI should consider focusing on.