Trawling scientific literature for gene-disease information

trawling for gene-disease informationOne of the things the TGMI is doing is building a Gene-Disease Map (GDM) which will provide a systematic overview of which genes are associated with disease. This is needed because it is currently very unclear how many disease genes there are, as we discussed in a previous blog post.  There are many challenges in this project. Here we look at just one of them – searching the scientific literature for gene–disease associations.


Which gene identifier should we search on?

The start of any information search is deciding what to search on. This can be challenging when searching for information on genes because an individual gene can be identified in many different ways. The Hugo Gene Nomenclature Committee (HGNC) is responsible for issuing and approving symbols and names for human genes, but both have changed for many genes over time. We now have a legacy of scientific literature with different identifiers used for the same gene, as we highlighted in a previous post.

Approximately a third of all human genes have at least one previously approved symbol, some 6% of genes have at least two and one, ARX, has 11 previously approved symbols.  For about 80 genes the current HGNC approved symbol is a previously approved symbol for a different gene, adding further confusion.


Unapproved gene identifiers

We searched on 60,258 different gene identifiers to construct the Gene-Disease Map

The HGNC-approved symbol has been used in the majority of scientific papers, but there is a significant proportion where other, unapproved symbols have been used to identify genes. For example BRCA2 is sometimes called FANCD1 because a child with two pathogenic BRCA2 mutations have a very rare condition called Fanconi Anemia subtype D1.

80% of genes have at least one unapproved symbol, 47% have at least two and the range extends to 19 different unapproved symbols.

The many different gene identifiers mean you have bait your search hook with multiple different lures before dipping it into the sea of publications, to make sure you search through all the available information for each gene. For EPCAM this requires searching on 23 different identifiers. And in total we searched on 60,258 different gene identifiers to construct the GDM.


Trawling the literature for the relevant information

Within the 27 million citations in NCBI’s PubMed – the database of biomedical literature – there are nearly 2.5 million about genes. These scientific papers cover a very broad range of information about genes, only a proportion of which is relevant to whether or not the gene can cause human disease. So once you have pulled together all the different gene identifiers so you know what to search on, the next challenge is refining the search to pull out the relevant information. In effect, we have to design our hooks to make sure we catch the right fish.

The GDM curators have made extensive use of the very useful Medical Genetics filter available in PubMed. We recommend it highly as an easy, consistent way to pull out the most relevant information in the scientific literature about gene-disease information. It isn’t full-proof, some crucial papers, particularly older papers, are not pulled out with the filter, so it hasn’t been our only strategy. But for most people’s purposes it is both sufficient and probably better than the ad-hoc approaches that are often used.


The challenge of text-based searches

Despite careful hook design and baiting with multiple lures it is still possible to catch huge numbers of publications that are irrelevant to constructing the GDM.The most frequent reason is that the gene symbol is a word in general use e.g. ACHE, CAMP, IMPACT, or a commonly used abbreviation e.g. HR, CS, PC, AR, TNF, CAD.

Informatic tools to tag gene names in biomedical text are being developed and are improving, but none are perfect as yet. These use techniques such as applying constraints to words preceding or succeeding the gene symbol. So far they still require specialist informatic input to set-up, so although they are increasingly useful for large-scale projects like the GDM, they are not widely accessible to the many non-specialists who are increasingly searching scientific literature for genetic information.

A further problem is the lack of clarity and consistency in how genetic findings are reported. For example, it can often be difficult to work out if the reported genetic variation was only in the tumor cells (somatic) or in all the cells  (germline). This is extremely important for constructing the GDM because only the latter can be inherited.

Another issue is that many studies are now reporting on 1000s of genes, and the relevant information may be deep in a supplementary table in a format that it is not readily accessible by standard searches.


Making things easier

TGMI is collaborating with HGNC in the stabilisation of gene symbols and together we aim to decide and fix a final, stable gene identifier for each gene. We also want to promote the concurrent use of the numeric gene ID as these are much more stable and are more useful and robust in informatic-based searches.

We will also be feeding through the challenges and lessons we have learnt in constructing the GDM to the initiatives working hard at trying to improve the speed, accuracy, quality, completeness and relevance of searches of scientific literature of gene-disease information.


Image by crew and officers of NOAA Ship Miller Freeman, via Flickr. CC-BY