Variant Updates (updated 14 August 2019)
The COSMIC database is undergoing an extensive update and reannotation, in order to ensure standardisation and modernisation across COSMIC data. This will substantially improve the identification of unique variants that may have been described at the genome, transcript and/or protein level. The introduction of a Genomic Identifier, along with complete annotation across multiple, high quality Ensembl transcripts and improved compliance with current HGVS syntax, will enable variant matching both within COSMIC and across other bioinformatic datasets.
As a result of these updates there will be significant changes in the upcoming releases as we work through this process. The first stage of this work was the introduction of improved HGVS syntax compliance in our May release and is detailed below. The majority of the changes will be reflected in COSMIC v90, which will be released in late August or early September, and the remaining changes will be introduced over the next few releases.
Changes in v90
The significant changes in v90 include:
- Updated genes, transcripts and proteins from Ensembl release 93 on both the GRCh37 and GRCh38 assemblies.
- Full reannotation of COSMIC variants with known genomic coordinates using Ensembl's Variant Effect Predictor (VEP). This provides accurate and standardised annotation uniformly across all relevant transcripts and genes that include the genomic location of the variant.
- New stable genomic identifiers (COSV) that indicate the definitive position of the variant on the genome. These unique identifiers allow variants to be mapped between GRCh37 and GRCh38 assemblies and displayed on a selection of transcripts.
- Updated cross-reference links between COSMIC genes and other widely-used databases such as HGNC, RefSeq, UniProt and CCDS.
- Complete standardised representation of COSMIC variants, following the most recent HGVS recommendations, where possible.
- Remapping of gene fusions on the updated transcripts on both the GRCh37 and GRCh38 assemblies, along with the genomic coordinates for the breakpoint positions.
- Reduced redundancy of mutations. Duplicate variants have been merged into one representative variant.
COSMIC variants have been annotated on all relevant Ensembl transcripts across both the GRCh37 and GRCh38 assemblies from Ensembl release 93. New genomic identifiers (e.g. COSV56056643) are used, which refers to the variant change at the genomic level rather than gene, transcript or protein level and can thus be used universally. Existing COSM IDs will continue to be supported and will now be referred to as legacy identifiers e.g. COSM476. The legacy identifiers (COSM) are still searchable. In the case of mutations without genomic coordinates, hence without a COSV identifier, COSM identifiers will continue to be used.
All relevant Ensembl transcripts in COSMIC (which have been selected based on Ensembl canonical classification and on the quality of the dataset to include only GENCODE basic transcripts) will now have both accession and version numbers, so that the exact transcript is known, ensuring reproducibility. This also provides transparency and clarity as the data are updated.
How these changes will be reflected in the download files
As we are now mapping all variants on all relevant Ensembl transcripts, the number of rows in the majority of variant download files has increased significantly. In the download files, additional columns are provided including the legacy identifier (COSM) and the new genomic identifier (COSV). An internal mutation identifier is also provided to uniquely represent each mutation, on a specific transcript, on a given assembly build. The accession and version number for each transcript are included. File descriptions listing and explaining the new columns are now available for each of the download files. They will also be available from the downloads page after the v90 release.
We have prepared files that contain a sample of the complete dataset for each of the COSMIC download files. This gives you real data in the final v90 file format to manipulate and integrate ahead of the release.
|COSMIC||tar zip||tar zip|
|Cell lines||tar zip||tar zip|
HGVS syntax update (COSMIC release 88)
In this release (v88) we have updated the
HGVS nomenclature for many of the
manually curated mutations that were published without CDS/genomic
c.? mutations). Details of how these updated
syntaxes are reflected in the data are given below.
Use of X in place of N to indicate unknown amino acid
We are now using
X to indicate an unknown amino acid instead
N as per
HGVS standards. Many of the more recently curated mutations retain
NNN... notation. These will be updated to the
XXX... notation in a future release.
Most manually curated frameshift mutations with unknown CDS change
c.?) now include the first mutant amino acid in the syntax.
c.? / p.C1396Lfs*5 c.? / p.V1833Afs*? c.? / p.S1303Xfs*58 c.? / p.P463Xfs*?
Frameshift mutations with known genomic/CDS details have not yet been
updated and therefore retain the original syntax, for example
c.355_356insATGG / p.E121fs*5.
Unknown substitution and insertions
Most missense substitution mutations with no reported CDS change now have
the syntax style
Unknown mutations remain
Most unknown insertion mutations now have the syntax
Whole gene deletions
Many manually curated whole gene deletions have been updated to the syntax
c.1_*del / p.0. A handful remain in the old style
c.1_3267del / p.0?.
fs*1 / nonsense mutations
Most manually curated mutations which had a
p.S123fs*1) syntax have been updated to a
substitution nonsense syntax (e.g.
p.S123*) and AA mutation
type, in keeping with HGVS recommendations. This also applies to a few
with known CDS information. However, most with genomic information have not
yet been updated and retain the old syntax style, e.g.