Stabilising gene symbols

Bringing stability to gene symbols is a central aim of the TGMI. It is part of our overarching aspiration to develop a Clinical Annotation Reference System (CARS). The CARS will provide a standard framework for consistent reporting of gene variation. As we have discussed in several posts, this is urgently needed; variability and inconsistency in genetic medicine is compromising research and clinical care. One of the most fundamental and important areas where we need to tackle this problem is in the description of gene variation.


Practicality is essential

There is much required and important information underpinning a description of a gene variant. All relevant information needs to be transparent, accessible and durably available. However, for practical use one also needs a ‘shorthand’ way of describing the variant. This shorthand term needs to be understandable and recognisable, it needs to encompass the essential information needed for medical use, and it needs to be linkable to the more detailed underlying information.

An example of the current format most often used in the clinic is:

BRIP1 c.2392C>T p.Arg798X

This notation shows that in the gene, BRIP1, at position 2392 in the DNA sequence of the gene, a ‘C’ has been swapped for a ‘T’; (c.2392C>T). This results in the amino acid arginine at position 798 of the protein coded by BRIP1 being swapped for a ‘stop’ codon (denoted by ‘X’); (p.Arg798X).

I used this example in a previous post to highlight some of the issues that need to be addressed in describing changes in DNA sequence. Here I want to focus on issues relating to the gene identifier.


Too many gene identifiers  

A single gene can have several different letter-based and/or number-based identifiers. Sometimes different purposes justify the use of different identifiers. Mostly, using different identifiers simply brings confusion and inconsistency. In the bygone era of a given gene only being of interest to a small number of clinical and scientific experts, this variability was tolerable. Experts would be aware that, for example, CASC5 and KNL1 are frequently used as symbols for the same gene.

But in today’s world of large-scale sequencing of all genes, by increasingly diverse medical and academic groups, the variability it is no longer tolerable. Requiring specialised knowledge of the idiosyncrasies of gene identifiers is not safe, nor sustainable, in genetic medicine. It inevitably leads to errors and inconsistencies, and impedes the robust integration of global gene variation data, which is so vital to modern genetics.


BRIP1 – an example of the problem

BRIP1, provides a perfect example of the problem of too many gene identifiers. It is by no means exceptional. Indeed, when I selected BRIP1 to highlight DNA sequence notation considerations I had no idea it would serve as such a good example of gene identifier problems!

BRIP1 is the official gene symbol for a gene on chromosome 17 whose full name is ‘BRCA1 interacting protein C-terminal helicase 1’. Individuals with a mutation in BRIP1 have an increased risk of ovarian cancer, and it is regularly tested in clinical practice. BRIP1 has also been called BACH1, which stands for ‘BRCA1/BRCA2-associated helicase 1’. But BACH1 is the official symbol of an entirely different gene, on chromosome 21, where it stands for ‘BTB and CNC homology 1, basic leucine zipper transcription factor 1’.

Today there are 368 entries in PubMed using the search term ‘BACH1’. I hoped to include in this post how many of these entries were related to the gene on chromosome 17 and how many were related to the gene on chromosome 21. But it was just too dispiritingly difficult to work out, partly because both genes have been implicated in cancer, adding yet more layers of confusion.

What my cursory review of the scientific literature did reveal was that FANCJ is also regularly in use as an alternative name for BRIP1. People with two BRIP1 mutations have a condition called Fanconi Anemia-subtype J, which is the origin of this identifier. But it is not, and has never been, the official symbol designated by the Hugo Gene Nomenclature Committee (HGNC).


Gene symbols – the universal anchor

HGNC give each gene a unique numeric ID. These are much more stable than gene names and gene symbols. For example, there has only ever been one numeric HGNC ID for BRIP1, HGNC:20473. At TGMI we are strongly promoting inclusion of this unique HGNC numeric gene ID in all gene analyses. But these can’t be used in clinical practice or on gene reports. We can’t talk about people having a mutation in the 20473 gene! It’s simply not practical. Neither is it practical to use the whole gene name: ‘You have a BRCA1 interacting protein C-terminal helicase 1 mutation’. The gene symbols are, and will remain, the universal anchor that links to everything else; we tell people they have a BRIP1 mutation.


We need universal usage of a fixed gene symbol

We need one, approved, permanent, universally-adopted gene symbol for every gene

I hope you are persuaded that we need to pay urgent attention to gene symbols. We need one, approved, permanent, universally-adopted gene symbol for every gene. If there are circumstances in which the use of other gene names or symbols is helpful, this should be justified, and the approved symbol must also be included. But in medical circumstances we need a system that ensures everyone can easily and consistently use only the approved gene symbol.


What is the TGMI doing?

TGMI has formed a close partnership with HGNC to tackle this problem. Elspeth Bruford, who runs HGNC, is now part of the core TGMI team and together HGNC and TGMI are developing a strategy to bring stability to gene symbols.

This is more complex than it might at first appear. It includes activities such as responsibly and transparently fast-tracking any necessary changes in gene symbols. For example, assigning permanent symbols to genes with temporary symbols. And no doubt we will need to persuade/justify to some people why they need to give up their attachment to non-approved symbols, for the greater good!

Then we need to fix the approved, permanent gene symbols so that any future changes will only occur in exceptional circumstances.

We are first focusing on the 1000s of genes that are already being tested in clinical practice. By the end of the programme we hope to have completed the task, (or at least to have a clear completion road-map), for all the ~20,000 genes.


3 thoughts on “Stabilising gene symbols

  • Raymond Dalgleish

    The article clearly articulates the need for stable gene symbols and I agree wholeheartedly with that argument. However, the “Practicality is essential” section promotes the dangerous and mistaken impression that a gene symbol (BRIP1 in this case) is a valid substitute for a reference sequence accession and version number when reporting a sequence variant. It turns out in the case of the BRIP1 gene that there is only the one reference transcript: NM_032043.2. Hence, the variant should be reported as NM_032043.2:c.2392C>T. Alternatively, it could also be validly reported at the genome level for GRCh38 as NC_000017.11:g.61716051G>A.

    Most genes have more than a single reference transcript, hence it’s essential to always specify the transcript against which the variant is reported. As an example, the ATG13 gene has 51 mRNA transcripts plus 3 non-coding RNA transcripts. Alternatively, a variant may be validly described in the context of a genome sequence but, again, the actual reference sequence must be precisely specified by giving the accession and version.

    Another issue is that the protein-level consequence of the BRIP1 sequence variant should be reported as p.Arg798Ter or as p.Arg798*. The IUPAC single-letter code X does not correspond to a translation stop. It has had different meanings over the years, but the current meaning of X is “any amino acid”. The valid symbols in the HGVS variant description recommendations for a translation stop are Ter and *.

    Given the professed aims of TGMI, I’m alarmed at the lack of standards compliance exemplified in this blog entry.

  • Nazneen Rahman Post author

    Dear Raymond, Thanks for taking the time to comment. Our blogs are short posts that focus on one specific issue – the need for stabilising gene symbols in this instance. But rest assured we are fully aware and engaged in the other aspects that you mention, some of which feature in other posts and also some of our publications. We are particularly focusing on the issue of reference transcripts currently, in close collaboration with the genomics and clinical communities. Given your knowledge and passion for the area, it would great to get your thoughts on it. I’ll drop you an email about it.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.