The 9 Vs of genetic medicine big data

The 9 Vs of genetic medicine big dataBig data has proved to be a major catalyst and a major bottleneck in genetic medicine. We cannot let data alone decide personal healthcare choices, however big it is. We need to turn data into knowledge and knowledge into wise actions. This is very challenging. Frameworks for translating big data into useful outputs are now emerging. How do these frameworks apply to genetic medicine big data?


The 9 Vs of big data

In recent years, people have started to breakdown and categorise the matters needing attention when using big data. Such lists define categories using words beginning with the letter ‘V’. Initially there were 3 Vs – volume, velocity and variety. But the list has been growing, some now include 42 Vs!. Below, I describe the 9 Vs I believe are important for generation and use of genetic medicine big data. To Volume, Velocity and Variety, I add Volatility, Variability, Validity, Vulnerability, Visibility and Value.



We are generating massive amounts of data in genetic medicine and the rate of volume generation continues to rise. 228,000 genomes were sequenced across the globe in 2014. By 2017, individual companies were starting to sequence this number of genomes per year, every year. Some predict that 1 billion genomes will have been sequenced by 2025. The volume of clinical data that genetic medicine relies on is also increasing dramatically, though we still need further expansion.

The data volume in genetic medicine, coupled with the ongoing year-on-year expansion of volume, lead to many issues that need to be handled. Simply storing the data in a safe, yet accessible way is hard. We also urgently need better ways to integrate and analyse big data.



We can generate genetic data at high velocity but we still cannot turn it into high-velocity results.

Velocity refers to the speed at which data are being generated. In many fields keeping up with the velocity of data has proved hard. But data velocity in genetic medicine remains a bottleneck – we are not generating genetic-based results fast enough, often enough. We can generate genetic data at high velocity, but we still cannot turn it into high-velocity results or decision-making


The variety of data needed for genetic medicine is a major challenge. Genetic data is simple, structured and standardised. It is ideally suited to big data approaches. But the clinical data we use is complex, unstructured and unstandardised. And it takes many different forms – patient examinations, blood results, X-rays, biopsies etc. Handling and integrating so many different types of clinical data, at volume and velocity is very difficult.



Variability in big data causes inconsistencies and errors. It is a big problem in genetic medicine, as we have highlighted in several posts. Large datasets can make the detection of variability more difficult and requires different strategies to small datasets.

A key reason the TGMI formed was to tackle the avoidable harms that variability in genetic medicine causes. The Gene-Disease Map is tackling variability in the links between genes and disease. The Clinical Annotation Reference System is removing variability in gene symbols and templates for variant annotation. The benchmarking resources we have made, such as the ICR96 exon CNV validation series, are reducing variability in genetic testing.



Knowledge continues to increase rapidly in genetic medicine. For example, every month we hear about the discovery of new disease-causing genes. This volatility of information, and the consequent need to update information, adds pressure and complexity to processes.

One area where this problem is often highlighted is in variant interpretation. Variant information changes daily, as more and more variants are reported. The new information can often help us refine how likely it is that a variant can cause disease. But, we have not worked out how to accommodate this rapidly changing knowledge-base into healthcare systems. We haven’t decided how often variant interpretations should be updated. We haven’t decided if it is the doctor’s responsibility to request an update, or if updates should be a routine responsibility of a genetic test provider. And we haven’t worked out a system that can reliably and appropriately communicate updates back to patients.

Phenotype data is inherently volatile – we all change in many ways as we go through life. Updating, integrating and storing longitudinal phenotype information is a major bottleneck in genetic medicine today, largely due to its volatility.



It is essential the data we use in genetic medicine is as accurate as possible. Developing standards and quality management systems that ensure validity are the foundation of safe medical practice, and will be the backbone of genetic medicine big data use. Today, we lack appropriate standards and quality management processes in many areas of modern genetic medicine. The need to reconfigure such processes so they can accommodate big data has been a particular issue. Simply upscaling the existing systems is often not possible, nor appropriate. We need to build new, big data compatible validity systems. The Quality Sequencing Minimum (QSM) we announced last week is an example of how TGMI is helping to drive forward delivery of big data validity in genetic testing.



All systems have vulnerabilities. There are specific vulnerabilities relevant to using big data that require focused attention. The risk of security breeches is a generic problem for all networked data, and privacy breeches are of high concern to society.

The vulnerabilities caused by variability, volatility and inadequate validity of big data are less often discussed, but of equal, if not greater importance. Transparency about vulnerabilities, together with commitment to finding and addressing vulnerabilities, is essential.



There are two ways in which visibility is important. The first relates to the need for visibility and transparency of how people are using big data to draw conclusions. We have too many black boxes in genetic medicine. One of the most valuable aspects of the QSM is in providing visibility about the quality of genetic testing.

Second, we need better ways to visualise big data, because it has to be presented in ways that humans can make sense of. The original form of the data never delivers this! Visualisation tools are improving, but are still far behind what we need. We spent considerable effort on the visualisation interface of CoverView to make it a more useful and user-friendly big data visualisation tool.



Finally, we must consider value, in all its guises, as we integrate big data into genetic medicine. There are opportunities and challenges here. The opportunity is that genetic medicine big data has the potential to revolutionise many aspects of healthcare, including diagnostics, drug development and disease prevention. It also has the potential to generate monetary value, which will drive innovation and sustainability. In 2015 the DNA sequencing market was valued at $4.6Bn and it is predicted to be worth $19Bn by 2020.

We need to turn data into knowledge and knowledge into wise actions.

But we should also spend time considering the societal impacts of genetic medicine big data and the changes it heralds. The potential for people to interact with their genetic information directly, flexibly, repeatedly and from the comfort of their own smartphones, is perhaps the biggest change. This will add extra challenges as genetic medicine becomes powered and dependent on big data.