Making data accessible


Data access

Data sharing is a hot topic that has permeated most areas of human endeavour in recent years. In genetic medicine the ethical principles of data sharing has become a central, controversial subject that is being broadly and animatedly debated. Whilst important, these discussions rarely address the practicalities of sharing data. How do we, how could we, how should we make data accessible?

 

Genetic medicine must share data

Many disciplines are, rightly, currently focusing their discussions on whether to share data at all. Genetic medicine is in a different position. We have to share data because we cannot treat our patients appropriately if we cannot compare their data with data from other patients. This data comparison step is an essential, integral part of the genetic testing process. It gives us opportunities to do more good, to make more diagnoses and to give people better information. It also gives us opportunities to do less harm, to make fewer misdiagnoses and fewer missed diagnoses. Data sharing in genetic medicine is essential for safe practice.

So we need to focus attention on how to make data accessible.

 

Data deposition is not the (only) answer

The default strategy for making data accessible is to deposit it in data repositories. Increasingly, data deposition is becoming mandated. For example, many journals stipulate that data must be deposited ‘somewhere’ for research to be eligible for publication. Genetic testing accreditation bodies are considering whether laboratories should be required to deposit data to be accredited.

The underlying principle for these recommendations and stipulations – that we must make data accessible – is correct. But there are issues with the data deposition model that we need to address. And alternative models to make data accessible also merit consideration.

 

Data repositories are not stable

I have always been an advocate of making data accessible and I was an early practitioner. Over the last fifteen years my group has deposited data in many different repositories. Several are no longer active. Some are active but are not being kept up-to-date. Others are active but are being used for other purposes and are not the appropriate place to house either our historical or current data.

This is not a criticism of data repositories. It is an inevitable consequence of how they are set-up and funded. Most data repositories relevant to genetic medicine were started by noble-minded, passionate individuals trying to serve our essential need for data access. Most were set-up with no, or limited, funding and no sustainability plan. All are underpinned by under-resourced people struggling to fulfil other commitments.

 

Data deposition is not easy

Every data repository says that depositing data with them is easy. For data generators this is a dark untruth compromising the core of data sharing. It is not easy to deposit data, at least not the first time. For each and every repository we have to organise our data to fit their requirements, which takes time and resources. There is scant communication between repositories and each has made things ‘easy’ for depositors in different ways that are rarely interchangeable.

As a comparison I have come to appreciate the clarity and relative simplicity of the UK online tax return self-assessment system. It has taken some years to reach this nirvana, through many false steps. Should I emigrate I suspect this hard-won expertise will be of little use in navigating the tax return systems of another country. Different data repositories are similarly individual in how they try to achieve similar goals.

 

Repeat data depositions are inefficient and costly

Another issue is that repositories rarely share data with each other. If a repository closes they don’t transfer the data to another repository, you have to do that yourself. Federation of data between repositories is being discussed but has not become widespread. And the robustness and governance of how data will be federated remains unclear.

So data generators find themselves having to deposit and maintain the same data in multiple data repositories, concurrently and sequentially. This is inefficient, error-prone, time intensive and costly.

 

Data provenance is important

When we are trying to work out if a genetic variant is relevant to a person’s medical condition we search databases and the medical literature to see if anyone else with the variant has been reported. Let’s say we find four people with the same very rare medical condition and the same genetic variant that our patient has. That same variant is only in one person not known to have the disease in the population databases that now include hundreds of thousands of individuals. We would consider this to be suggestive evidence that the genetic variant has caused the rare medical condition.

But what if it turns out that it is not four different people. It is one person that has been reported four times. So the variant has been reported in one person known to have the disease and one person not known to have the disease. This dramatically changes the likelihood that the genetic variant has caused the medical condition.

This type of error is, unfortunately, very common. We must maintain the provenance link between the data and the data generator to prevent it.

 

Data governance is important

We urgently need sustainable genetic data repositories that adhere to defined standards and are accredited and regulated.

Once you have deposited data, who is responsible for its accuracy and integrity? Is it the repository? Or is it depositor?  Or is it both? As a depositor what responsibilities do you have to ensure that the data repository is adhering to standards and what standards should they adhere to?

There are many repositories that catalogue variants in individual genes. Some also include information about the medical history of the person with the variant. These databases have become the bedrock of genetic medicine. But none are official, regulated, accredited data repositories for use in genetic medicine. As far as I am aware they all have disclaimers similar to ClinVar, perhaps the most widely used database in genetic medicine, that their information is ‘not intended for direct diagnostic use or medical decision-making’.

It is entirely correct for the data repositories to do this. But given that these data are essential for the practice of genetic medicine we urgently need to decide how we are going to build and fund sustainable data repositories that adhere to defined standards and are accredited and regulated.

 

Making data accessible

Every week in our clinical testing laboratory, TGLclinical, we analyse 100 cancer predisposition genes in 96 people. As a data generator with a deep desire to make data accessible I would like to be able to release, each week, the data that I want and need to make accessible. There is a simple, minimal dataset of the variants detected and if the person had cancer or not that would be extremely useful for the world and would be press-of-a-button easy for us to release.

What do I need to do this? Firstly I need to have stable, consistent gene symbols, gene templates and genetic variant notation systems so that the data we release can be easily and faithfully integrated elsewhere. The Clinical Annotation Reference System (CARS) that the TGMI is building will help achieve this. Secondly, I need this to be an acceptable way for data generators to share their data with data repositories.

Data repositories would then have the responsibility to integrate the data we have made accessible into their repositories. The repositories will have to make their formats accommodate how we have made our data accessible, rather than the other way around. The repositories will need to maintain the provenance information and have systems to accommodate updates.

A single-point data accessibility system would allow data generators to do one-time data releases and one-time data updates, instead of multiple data depositions. It would be more efficient, more robust and a better use of clinical laboratory time.

 

Making it easy to make data accessible is essential

Most importantly we need to focus attention on the data generators to help make it as easy as possible for them to make their data accessible. This involves addressing the issues about data privacy and ownership that is the primary focus of most data sharing discussions. But we also urgently need to design and implement practical, efficient, robust, sustainable systems for making data accessible.

The promise and safety of genetic medicine depends on it.