Bringing stability to gene symbols is a central aim of the TGMI. It is part of our overarching aspiration to develop a Clinical Annotation Reference System (CARS). The CARS will provide a standard framework for consistent reporting of gene variation. As we have discussed in several posts, this is urgently needed; variability and inconsistency in genetic medicine is compromising research and clinical care. One of the most fundamental and important areas where we need to tackle this problem is in the description of gene variation.
Practicality is essential
There is much required and important information underpinning a description of a gene variant. All relevant information needs to be transparent, accessible and durably available. However, for practical use one also needs a ‘shorthand’ way of describing the variant. This shorthand term needs to be understandable and recognisable, it needs to encompass the essential information needed for medical use, and it needs to be linkable to the more detailed underlying information.
An example of the current format most often used in the clinic is:
BRIP1 c.2392C>T p.Arg798X
This notation shows that in the gene, BRIP1, at position 2392 in the DNA sequence of the gene, a ‘C’ has been swapped for a ‘T’; (c.2392C>T). This results in the amino acid arginine at position 798 of the protein coded by BRIP1 being swapped for a ‘stop’ codon (denoted by ‘X’); (p.Arg798X).
I used this example in a previous post to highlight some of the issues that need to be addressed in describing changes in DNA sequence. Here I want to focus on issues relating to the gene identifier.
Too many gene identifiers
A single gene can have several different letter-based and/or number-based identifiers. Sometimes different purposes justify the use of different identifiers. Mostly, using different identifiers simply brings confusion and inconsistency. In the bygone era of a given gene only being of interest to a small number of clinical and scientific experts, this variability was tolerable. Experts would be aware that, for example, CASC5 and KNL1 are frequently used as symbols for the same gene.
But in today’s world of large-scale sequencing of all genes, by increasingly diverse medical and academic groups, the variability it is no longer tolerable. Requiring specialised knowledge of the idiosyncrasies of gene identifiers is not safe, nor sustainable, in genetic medicine. It inevitably leads to errors and inconsistencies, and impedes the robust integration of global gene variation data, which is so vital to modern genetics.
BRIP1 – an example of the problem
BRIP1, provides a perfect example of the problem of too many gene identifiers. It is by no means exceptional. Indeed, when I selected BRIP1 to highlight DNA sequence notation considerations I had no idea it would serve as such a good example of gene identifier problems!
BRIP1 is the official gene symbol for a gene on chromosome 17 whose full name is ‘BRCA1 interacting protein C-terminal helicase 1’. Individuals with a mutation in BRIP1 have an increased risk of ovarian cancer, and it is regularly tested in clinical practice. BRIP1 has also been called BACH1, which stands for ‘BRCA1/BRCA2-associated helicase 1’. But BACH1 is the official symbol of an entirely different gene, on chromosome 21, where it stands for ‘BTB and CNC homology 1, basic leucine zipper transcription factor 1’.
Today there are 368 entries in PubMed using the search term ‘BACH1’. I hoped to include in this post how many of these entries were related to the gene on chromosome 17 and how many were related to the gene on chromosome 21. But it was just too dispiritingly difficult to work out, partly because both genes have been implicated in cancer, adding yet more layers of confusion.
What my cursory review of the scientific literature did reveal was that FANCJ is also regularly in use as an alternative name for BRIP1. People with two BRIP1 mutations have a condition called Fanconi Anemia-subtype J, which is the origin of this identifier. But it is not, and has never been, the official symbol designated by the Hugo Gene Nomenclature Committee (HGNC).
Gene symbols – the universal anchor
HGNC give each gene a unique numeric ID. These are much more stable than gene names and gene symbols. For example, there has only ever been one numeric HGNC ID for BRIP1, HGNC:20473. At TGMI we are strongly promoting inclusion of this unique HGNC numeric gene ID in all gene analyses. But these can’t be used in clinical practice or on gene reports. We can’t talk about people having a mutation in the 20473 gene! It’s simply not practical. Neither is it practical to use the whole gene name: ‘You have a BRCA1 interacting protein C-terminal helicase 1 mutation’. The gene symbols are, and will remain, the universal anchor that links to everything else; we tell people they have a BRIP1 mutation.
We need universal usage of a fixed gene symbolI hope you are persuaded that we need to pay urgent attention to gene symbols. We need one, approved, permanent, universally-adopted gene symbol for every gene. If there are circumstances in which the use of other gene names or symbols is helpful, this should be justified, and the approved symbol must also be included. But in medical circumstances we need a system that ensures everyone can easily and consistently use only the approved gene symbol.
What is the TGMI doing?
TGMI has formed a close partnership with HGNC to tackle this problem. Elspeth Bruford, who runs HGNC, is now part of the core TGMI team and together HGNC and TGMI are developing a strategy to bring stability to gene symbols.
This is more complex than it might at first appear. It includes activities such as responsibly and transparently fast-tracking any necessary changes in gene symbols. For example, assigning permanent symbols to genes with temporary symbols. And no doubt we will need to persuade/justify to some people why they need to give up their attachment to non-approved symbols, for the greater good!
Then we need to fix the approved, permanent gene symbols so that any future changes will only occur in exceptional circumstances.
We are first focusing on the 1000s of genes that are already being tested in clinical practice. By the end of the programme we hope to have completed the task, (or at least to have a clear completion road-map), for all the ~20,000 genes.