Variant Updates (updated 14 August 2019)

The COSMIC database is undergoing an extensive update and reannotation, in order to ensure standardisation and modernisation across COSMIC data. This will substantially improve the identification of unique variants that may have been described at the genome, transcript and/or protein level. The introduction of a Genomic Identifier, along with complete annotation across multiple, high quality Ensembl transcripts and improved compliance with current HGVS syntax, will enable variant matching both within COSMIC and across other bioinformatic datasets.

As a result of these updates there will be significant changes in the upcoming releases as we work through this process. The first stage of this work was the introduction of improved HGVS syntax compliance in our May release and is detailed below. The majority of the changes will be reflected in COSMIC v90, which will be released in late August or early September, and the remaining changes will be introduced over the next few releases.

Changes in v90

The significant changes in v90 include:

Key points

COSMIC variants have been annotated on all relevant Ensembl transcripts across both the GRCh37 and GRCh38 assemblies from Ensembl release 93. New genomic identifiers (e.g. COSV56056643) are used, which refers to the variant change at the genomic level rather than gene, transcript or protein level and can thus be used universally. Existing COSM IDs will continue to be supported and will now be referred to as legacy identifiers e.g. COSM476. The legacy identifiers (COSM) are still searchable. In the case of mutations without genomic coordinates, hence without a COSV identifier, COSM identifiers will continue to be used.

All relevant Ensembl transcripts in COSMIC (which have been selected based on Ensembl canonical classification and on the quality of the dataset to include only GENCODE basic transcripts) will now have both accession and version numbers, so that the exact transcript is known, ensuring reproducibility. This also provides transparency and clarity as the data are updated.

How these changes will be reflected in the download files

As we are now mapping all variants on all relevant Ensembl transcripts, the number of rows in the majority of variant download files has increased significantly. In the download files, additional columns are provided including the legacy identifier (COSM) and the new genomic identifier (COSV). An internal mutation identifier is also provided to uniquely represent each mutation, on a specific transcript, on a given assembly build. The accession and version number for each transcript are included. File descriptions listing and explaining the new columns are now available for each of the download files. They will also be available from the downloads page after the v90 release.

Taster files

We have prepared files that contain a sample of the complete dataset for each of the COSMIC download files. This gives you real data in the final v90 file format to manipulate and integrate ahead of the release.

File descriptions file_descriptions_v90.xlsx

HGVS syntax update (COSMIC release 88)

In this release (v88) we have updated the HGVS nomenclature for many of the manually curated mutations that were published without CDS/genomic information (c.? mutations). Details of how these updated syntaxes are reflected in the data are given below.

Use of X in place of N to indicate unknown amino acid

We are now using X to indicate an unknown amino acid instead of N as per HGVS standards. Many of the more recently curated mutations retain the NNN... notation. These will be updated to the XXX... notation in a future release.

Frameshift mutations

Most manually curated frameshift mutations with unknown CDS change (c.?) now include the first mutant amino acid in the syntax. For example:

c.? / p.C1396Lfs*5
c.? / p.V1833Afs*?
c.? / p.S1303Xfs*58
c.? / p.P463Xfs*?

Frameshift mutations with known genomic/CDS details have not yet been updated and therefore retain the original syntax, for example c.355_356insATGG / p.E121fs*5.

Unknown substitution and insertions

Most missense substitution mutations with no reported CDS change now have the syntax style p.H1904X.

Unknown mutations remain p.H1904?.

Most unknown insertion mutations now have the syntax p.F12_S13insXXX.

Whole gene deletions

Many manually curated whole gene deletions have been updated to the syntax c.1_*del / p.0. A handful remain in the old style c.1_3267del / p.0?.

fs*1 / nonsense mutations

Most manually curated mutations which had a fs*1 (for example p.S123fs*1) syntax have been updated to a substitution nonsense syntax (e.g. p.S123*) and AA mutation type, in keeping with HGVS recommendations. This also applies to a few with known CDS information. However, most with genomic information have not yet been updated and retain the old syntax style, e.g. c.368_369insT / p.N124fs*1.