V90 Release Changes (5 September 2019)

The v90 release includes important structural updates and a complete reannotation of the COSMIC data. This will help us ensure standardisation and modernisation across COSMIC in the future, substantially improving the ease of identification of unique variants that may have been described at the genome, transcript and/or protein level.

With this release, COSMIC's definition of a mutation has been updated. A parent-child relationship has been introduced between the variant as described at the genomic level (the parent) and the variant as described by the annotation on one or more transcripts (the child or children). The introduction of this new relationship necessitates the introduction of a new genomic mutation identifier, COSV, for the parent variant.

In concert with the underlying structural changes there has also been a complete reannotation across multiple, high quality Ensembl transcripts. Variants syntax now have improved compliance with current HGVS syntax. Taken together, these changes will improve variant matching both within COSMIC and across other bioinformatic datasets.

This page is the central point for all information regarding these updates. Below we outline the changes to the data along with the downstream effects to the download files and the website.

Data Updates

1. Updated genes, transcripts and proteins from Ensembl release 93 independently on both the GRCh37 and GRCh38 assemblies. Show less

To ensure standardisation and modernisation across COSMIC, we have brought all our data on to genes, transcripts and proteins from Ensembl release 93. This has been done independently for both the GRCh37 and GRCh38 assemblies and includes up-to-date gene symbols and updated genomic coordinates for GRCh37.p13 and GRCh38.p12. We have also provided the full coordinates of transcripts (as opposed to coordinates corresponding to the CDS region only).

All Ensembl transcripts in COSMIC, which have been selected based on the quality of the dataset to include only GENCODE basic, will now have both accession and version numbers. Limiting the annotation of mutations to the high-quality GENCODE-basic transcripts for each gene ensures accurate variant annotation on the UTR regions of the gene, providing transparency and clarity as the data are updated in the future. Variants are shown on both the canonical transcript, as determined by Ensembl, and on alternative GENCODE basic transcripts.

In some cases due to differences between the Ensembl GRCh37 and GRCh38 gene builds, certain transcripts have changed significantly. This means that the variant annotation on GRCh38 correctly results in mutation syntaxes that are different to those on GRCh37.


2. There has been a full reannotation of COSMIC variants with known genomic coordinates using Ensembl's Variant Effect Predictor (VEP). This provides accurate and standardised annotation of variants with known genomic location uniformly across all relevant transcripts and genes.

COSMIC's in-house annotation suite has been replaced by Ensembl VEP. In order to ensure consistency, Ensembl release 93 has been used across both data and software. VEP annotates a given variant on all transcripts that overlap a given location, resulting in each variant being annotated on all transcripts. COSMIC will include the variant only on relevant protein-coding transcripts (GENCODE basic only) that overlap that genomic region.

There is now a more accurate representation of mutations that lie within intronic and UTR regions of genes. Previously, the majority of intronic mutations that existed more than 10 bp outside of a splice-site and those lying within a UTR were represented as non-coding variants. These variants had a COSN identifier. Such variants have now been annotated in the coding domain by VEP and have HGVS-compliant syntaxes and a genomic mutation identifier (COSV). Detailed description of all the identifiers and how these have changed is in section 3.

Coding and non-coding variants in COSMIC are now defined as follows:

Coding mutations
Mutations that lie within the gene boundary of a protein-coding transcript from Ensembl. These are identified with a genomic identifier (COSV), a legacy mutation identifier (COSM or COSN), and an alternative mutation identifier.
Non-coding mutations
Mutations that lie within an intergenic region of the genome. These are identified with a genomic identifier (COSV), a legacy mutation identifier (COSM or COSN).

In some specific cases where mutations exist in the non-coding region, they have been manually curated as coding mutations and have been given a COSM identifier in previous releases. This includes, for example, the TERT promoter mutations, which have notations such as c.1-124C>A. These have temporarily been removed from COSMIC, since VEP cannot annotate coding mutations outside the coding region. We are working on a solution to bring these mutations back into COSMIC in a future release. However, in v97 we have provided the file NCV_CDS_syntax_mapping.tsv to help map TERT promoter mutation identifers with curated CDS syntaxes. This file contains the information needed to map these CDS syntaxes to mutations described in the CosmicNonCodingVariants.vcf.gz or CosmicNCV.tsv.gz download files.


3. The new stable genomic mutation identifiers (COSV) indicate the definitive position of the variant on the genome. These unique identifiers allow variants to be mapped between GRCh37 and GRCh38 assemblies and displayed on a selection of transcripts. Show less

Definitions of the new COSMIC identifiers

Genomic Mutation Identifier

The genomic mutation identifier (COSV) indicates the definitive position of the variant on the genome. This identifier is trackable and stable between different versions of the release and remains the same between different assemblies (GRCh37 and GRCh38). The genomic mutation identifier is now the preferred way to identify mutations. Note that mutations with no known genomic coordinates will not have a value for this identifier, and may be identified using the legacy mutation identifier (see below). The genomic mutation identifier is referred to as GENOMIC_MUTATION_ID in the download files.

Legacy Mutation Identifier

The existing COSM and COSN mutation identifiers are now referred to as legacy mutation identifiers, retaining their COSM/COSN nomenclature. These identifiers remain the same between different assemblies (GRCh37 and GRCh38). The legacy mutation identifier is referred to as LEGACY_MUTATION_ID in the download files.

Previously, each mutation at a specific genomic coordinate but on a different transcript had a unique COSM identifier. Now, all COSM identifiers at the same genomic location have been collapsed into one representative COSM identifier. All previous COSM identifiers are being maintained in order to enable tracking of existing mutations. Mutations can be found in the website using legacy mutation identifiers, either by entering the full identifier in the search bar, e.g. COSM476, or by entering the URL directly, inserting the numerical part of the COSM, e.g. https://cancer.sanger.ac.uk/cosmic/mutation/overview?id=476.

Similarly, COSN mutations may be found by searching using the legacy mutation identifier, e.g. COSN9832680, or by inserting the numerical part of the variant's corresponding COSN into the URL, e.g. https://cancer.sanger.ac.uk/cosmic/ncv/overview?id=71991075. Please note the addition of ncv in the URL.

Alternative Mutation Identifiers

These are internal identifiers that are unique to a mutation on a particular transcript and are displayed in the URL of the mutation pages. Therefore, several of these alternative mutation IDs could be associated with a single genomic mutation identifier (COSV), in cases where the mutation has been mapped to all overlapping genes and transcripts. The alternative mutation identifier is referred to as MUTATION_ID in the download files.

Similarly, since every variant with known genomic coordinates has been assigned a genomic mutation identifier, each legacy mutation identifier (COSM) can also be associated with several alternative mutation identifiers. These internal IDs are expected to change between assemblies (GRCh37 and GRCh38), since each assembly build has its own set of genes and transcripts, and between releases.

For example, the NRAS mutation COSV54736310 (COSM580) on transcript ENST00000369535, on GRCh37 has the alternative mutation ID 32596173, while on GRCh38 it has the Alternative mutation ID 97107326.

COSMIC's definition of a mutation has been updated to introduce a parent-child relationship between the variant as described at the genomic level (the parent) and the variant as described by the annotation on one or more transcripts (the child or children). The parent is identified with a new genomic mutation identifier (COSV) and the child mutations are all identified by the same legacy mutation identifier (COSM). Each child mutation is assigned a unique alternative mutation ID.

This is illustrated in the diagram below. Here the parent BRAF mutation is identified as COSV56056643. The child mutations on the canonical transcripts for the GRCh37 and GRCh38 assemblies are shown in green. Child mutations on alternative transcripts are shown in blue. All children share the legacy COSM476 identifier, but each has a unique alternative mutation ID. The different downstream consequences of the same parent mutation can be seen in the different CDS and protein sequences of the children.


4. The cross-references between COSMIC genes and other widely-used databases, such as HGNC, RefSeq, UniProt and CCDS, have been updated.


5. Complete, standardised representation of COSMIC variants, following the most recent HGVS recommendations, where possible. Show less

In order to standardise the representation of COSMIC variants, where possible the syntax has been updated to match the current HGVS recommendations. This has generally been done using VEP, however, where mutations don't have clear CDS genomic information (usually c.? mutations) these have been updated manually. A summary of the changes applied to the manually curated mutations is given below.


6. The gene fusions have been remapped on the updated transcripts on both the GRCh37 and GRCh38 assemblies, along with the genomic coordinates for the breakpoint positions. Show less

COSMIC gene fusions have been mapped to the new, updated Ensembl transcripts. Consequently, some breakpoint coordinates have changed, if 5' or 3' UTR lengths are different in the new transcripts. Breakpoints continue to be recorded using cDNA coordinates but we also provide genomic coordinates.

As elsewhere we have updated the HGVS syntax and follow the HGVS nomenclature rules where appropriate. For example:

Previous syntax:

EML4{ENST00000318522}:r.1_1751_ALK{ENST00000389048}:r.4080_6220

New style syntax, including accession with version number, as well as gene symbol:

ENST00000318522.5(EML4):r.1_1751_ENST00000389048.3(ALK):r.4080_6220

Previous syntax where o indicates sequence in reverse orientation:

FUS{ENST00000254108}:r.1_619_oCREB3L2{ENST00000330387}:r.936-18_948_CREB3L2{ENST00000330387}:r.1006_7412

New style syntax where inv indicates sequence in reverse orientation:

ENST00000254108.7(FUS):r.1_619_ENST00000330387.6(CREB3L2):r.936-18_948inv_ENST00000330387.6(CREB3L2):r.1006_7412

We have standardised the fusion translocation names (syntaxes) and merged variants where the Inferred Breakpoint and the Observed mRNA result in the same syntax. These will now share the same COSMIC fusion identifier (COSF). For example:

Previously: Inferred Breakpoint COSF463:

EML4{ENST00000318522}:r.1_1751_ALK{ENST00000389048}:r.4080_6220

And observed mRNA COSF408:

EML4{ENST00000318522}:r.1_1751_ALK{ENST00000389048}:r.4080_6220

Now: Inferred breakpoint and observed mRNA COSF408

ENST00000318522.5(EML4):r.1_1751_ENST00000389048.3(ALK):r.4080_6220


7. Where duplicate variants exist at a single genomic location they have been merged into one representative variant. Show less

In order to reduce redundancy and help with the accuracy of the counts, duplicate variants have been collapsed into a single representative variant. This means that some COSM and COSN IDs no longer exist. The representative mutation is present in COSMIC and the retired COSM and COSN IDs are still searchable on the website. When searching for a retired COSM the website seamlessly redirects to the newly merged identifier. The mutation overview page will display the following message if the mutation has been merged:

"The legacy mutation COSM5846084 has now been merged into the following mutation."

Examples of this are COSM5846084 and COSM5846086, both of which have been merged into COSM5846085.

The retired COSM and COSN IDs are not currently available in the download files, but the mapping between retired IDs and the representative legacy ID will be made available when possible.


8. The sample and mutation counts will be different. Show less

The definition of a mutation within COSMIC has fundamentally changed with the introduction of the parent-child model and the new genomic mutation identifier. A combination of the annotation on updated transcripts, annotation of variants on multiple transcripts and de-duplication of variants has affected the sample and mutation counts.


9. SNP filtering is now in place for the Cell Lines Project (CLP) data. Show less

The CLP data has now undergone SNP filtering, similar to the COSMIC data but without the usual exceptions such as classic genes, curated papers, etc, since CLP includes only whole genome sequencing (WGS) data. This will mean that some mutations which were visible on the website previously will now be absent, but these variants will still be available in the download files.

Changes to the Download Files

This update brings significant changes to the structure and quantity of COSMIC data, as all variants are now mapped on all protein-coding Ensembl transcripts. The download files include the legacy mutation identifier (COSM), the new genomic mutation identifier (COSV), and in many cases an alternative mutation identifier that uniquely represents a specific mutation on a specific transcript on a given assembly build. Accession and version numbers are included for each transcript. Descriptions for each of the download files are available as an excel document and from the downloads page.

Additional columns in tab-separated format (TSV) files

File New column Description
CosmicCompleteTargetedScreensMutantExport.tsv GENOMIC_MUTATION_ID Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSM)
MUTATION_ID Alternative mutation identifier
CosmicGenomeScreensMutantExport.tsv MUTATION_ID Alternative mutation identifier
GENOMIC_MUTATION_Id Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSM)
CosmicMutantExport.tsv GENOMIC_MUTATION_Id Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSM)
MUTATION_ID Alternative mutation identifier
CosmicMutantExportCensus.tsv GENOMIC_MUTATION_ID Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSM)
MUTATION_ID Alternative mutation identifier
CosmicNCV.tsv GENOMIC_MUTATION_ID Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSN)
CosmicResistanceMutations.tsv MUTATION_ID Alternative mutation identifier
GENOMIC_MUTATION_Id Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSM)
CosmicCLP_MutantExport.tsv GENOMIC_MUTATION_Id Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSM)
MUTATION_ID Alternative mutation identifier
CosmicCLP_NCVExport.tsv GENOMIC_MUTATION_Id Genomic mutation identifier (COSV)
LEGACY_MUTATION_ID Legacy mutation identifier (COSN)
CosmicFusionExport.tsv 5'_CHR Chromosome of 5' gene
5'_GENOME_START_FROM The genomic coordinate of the start (+ strand)/breakpoint (- strand) of the 5' fusion gene as described in the Translocation Name
5'_GENOME_START_TO The range of genomic coordinates of the start (+ strand)/breakpoint (- strand) of the 5' fusion gene if it is an unknown base position
5'_GENOME_STOP_FROM The genomic coordinate of the breakpoint (+ strand)/start (- strand) of the 5' fusion gene as described in the Translocation Name
5'_GENOME_STOP_TO The range of genomic coordinates of the breakpoint (+ strand)/start (- strand) of the 5' fusion gene if it is an unknown base position
5'_STRAND The orientation of the 5' gene (+/-)
3'_CHR Chromosome of 3' gene
3'_GENOME_START_FROM The genomic coordinate of the breakpoint (+ strand)/stop (- strand) of the 3' fusion gene as described in the Translocation Name
3'_GENOME_START_TO The range of genomic coordinates of the breakpoint (+ strand)/stop (- strand) of the 3' fusion gene if it is an unknown base position
3'_GENOME_STOP_FROM The genomic coordinate of the stop (+ strand)/breakpoint (- strand) of the 3' fusion gene as described in the Translocation Name
3'_GENOME_STOP_TO The range of genomic coordinates of the stop (+ strand)/breakpoint (- strand) of the 3' fusion gene if it is an unknown base position
3'_STRAND The orientation of the 3' gene (+/-)

Additional columns in Variant Call Format (VCF) files

VCF files incorporate the new IDs in the following columns:

File Details
VCF/CosmicCodingMuts.vcf.gz ID column now holds the genomic mutation identifier (COSV)
INFO column now has an additional key 'LEGACY_ID' (COSM)
VCF/CosmicNonCodingVariants.vcf.gz ID column now holds the genomic mutation identifier (COSV)
INFO column now has an additional key 'LEGACY_ID' (COSN)
INFO column also has an additional key 'CNT' to denote the sample count for each variant
VCF/CellLinesCodingMuts.vcf.gz ID column now holds the genomic mutation identifier (COSV)
INFO column now has an additional key 'LEGACY_ID' (COSM)
VCF/CellLinesNonCodingVariants.vcf.gz ID column now holds the genomic mutation identifier (COSV)
INFO column now has an additional key 'LEGACY_ID' (COSN)
INFO column also has an additional key 'CNT' to denote the sample count for each variant

Changes to the website

Inevitably, there have been changes to the website in order to accommodate the new parent-child model, but every attempt has been made to restrict the number and extent of the changes in order to minimise disruption.

In general, mutations remain searchable as previously, via the legacy identifiers (COSM or COSN). This will display the variant overview page, with the variant displayed on the canonical transcript (in the case of coding mutations). The new genomic identifier is displayed at the top of the page, along with the legacy mutation identifier (COSM). An 'unknown' genomic mutation identifier indicates that the genomic coordinates for this variant are unknown and therefore it does not have a genomic identifier. The new Genomic mutation identifier (COSV) is not currently searchable on the website. The alternative mutation ID (which can be found in the overview section) may be used to view the variant on alternative transcripts.

Mapping between genome assemblies

The "Genome Version" menu switches the view between GRCh37 and GRCh38 assembly builds. When viewing a gene on the canonical transcript in GRCh37, switching to view the same gene on GRCh38 displays variants on the canonical transcript for GRCh38, irrespective of whether the canonical transcripts are the same between assemblies. For example, switching from viewing BRAF variants on the canonical transcript on GRCh37 (ENST00000288602.6) will display BRAF variants on the canonical transcript on GRCh38 (ENST00000646891.1).

When viewing the gene on an alternative transcript, switching between genome assemblies will only be successful if there is an equivalent transcript on the other assembly. If the equivalent transcript does not exist then the page will show "No data is available". The gene can however, be found on the new assembly by searching using the gene name which will display variants on the canonical transcript for that assembly. From here the variants can be viewed on the alternative transcripts as required.

Switching between assemblies operates in a similar way at the mutation level. When viewing a variant on the canonical transcript in GRCh37, switching to view the same variant on GRCh38 displays the variant on the equivalent transcript in the other assembly. In most cases this is also the canonical transcript, but not always, so the switch may occasionally reflect the change in the resulting CDS and AA syntaxes. If the equivalent transcript does not exist on the other assembly then the page will show "No data is available". The variant can however, be found on the new assembly by searching using the COSM, which will display the variant on the canonical transcript for that assembly. From here the variant can be viewed on the alternative transcripts as required.

What is happening when something unexpected is seen?

There are several situations in which behaviours may have changed more substantially and data will not be presented on the website as they were previously. Some common examples are:

The mutation has been flagged as a SNP mutation

Example: COSM7088790

This is highlighted on the web page with a message:

"The mutation COSM7088790 has been flagged as a SNP. It is present in the mutation download files on the SFTP site but has been excluded from the website."

This will now occur on the Cell Lines Project pages too. All SNPs are still available in the complete mutation download files.

The mutation has been merged with another to remove duplication

Example: COSM5846084 and COSM5846086 have been merged into COSM5846085

Some mutations were previously represented by multiple COSM identifiers. Either they were true duplicates, or, for example, there could be various uncertain mutations each with individual and specific Mutation Remarks. Such redundancy has been eliminated by merging duplicate mutations, including merging all uncertain mutations for any given gene into a single legacy mutation identifier. Remarks for these merged mutations are not currently available but will be bought back in a future release.

The merged COSMs are still accessible via the search function and the website seamlessly redirects to the newly merged identifier. If the mutation has been merged the mutation overview page will display:

"The legacy mutation COSM5846084 has now been merged into the following mutation."

The mutation has unknown genomic coordinates

Mutations without known genomic coordinates will not have a new genomic mutation identifier (COSV). However, these mutations can still be viewed on both assemblies using the "Genome Version" switch on the website. As these mutations have been mapped using the transcript (and not the genomic location) it has not always been possible to lift over the mutation from one assembly build to the other, in which case the mutation will exist in only one assembly. For example, COSM7002422 can only be viewed on the GRCh37 assembly and in such cases the webpage will display a message saying:

"The mutation with ID 7002422 was not found in our database."

The mutation has no equivalent transcript on the other assembly

In some cases a transcript will exist only on one genome assembly. For example, variant COSM476 may be viewed on an alternative transcript such as ENST00000644969.1 on GRCh38. If the view is switched from the GRCh38 assembly to the GRCh37 assembly, since there is no equivalent transcript in GRCh37, the variant cannot be displayed and the page will show a message saying:

"The mutation with ID 149385292 was not found in our database."

The mutation can only be shown on alternative transcripts

In some cases mutations could only be annotated on the alternative transcript of a gene because there is no genomic overlap between the variant and the canonical transcript for that gene. A search for such a mutation will display a table of the annotations on the alternative transcripts on which the mutation may be viewed. This can be seen with COSM2988337.

The mutation has different CDS syntax presented on the GRCh37 and GRCh38 assembly builds

Due to differences between the Ensembl GRCh37 and GRCh38 gene builds, certain transcripts have changed significantly. This means that the variant annotation on GRCh38 may result in mutation syntaxes which are different to those on GRCh37.

An example of this is BRAF COSM476, which when viewed on the canonical transcript on GRCh37 (ENST00000288602.6) has the CDS syntax c.1799T>A, resulting in the AA mutation p.V600E.

When the view is switched to GRCh38, this mutation appears on the equivalent transcript (ENST00000288602.11), which now has the CDS syntax c.1919T>A and resulting AA mutation p.V640E.

This situation can be further confused when the canonical transcript is different between the GRCh37 and GRCh38 assembly builds, as is the case with BRAF. This applies to the following 33 expert curated/classic genes:

The mutation can be viewed on multiple genes

In some cases, genes overlap at the genomic location of a variant. A mutation that was originally mapped to one gene will now be mapped to all overlapping genes and transcripts at that genomic location. A search for such a mutation (via the legacy mutation identifier) will display a table of all annotations, complete with information for canonical transcripts. This happens with COSM1002220, where you can select to view the variant on either the ZNF17 or AC004076.7 genes The variant will have the same genomic mutation identifier and legacy mutation identifier but a unique alternative mutation ID depending on the gene/transcript that it is viewed on.

The mutation can be viewed as both coding and non-coding depending on its location on or near each transcript

Previously, all mutations were classified as coding (COSM) or non-coding (COSN). As all variants have been reannotated using VEP, some variants now fall within coding regions of some transcripts and/or non-coding regions of other transcripts of a gene. These variants will share the same legacy mutation identifier (COSM/COSN) and be assigned a new genomic mutation identifier (COSV).

For example:

  1. COSN7964089 is displayed on the website as a non-coding variant, but it can also be found in the list of variants for BRAF (on ENST00000288602) with the mutation syntax c.-47C>T, showing that it lies within the 5'UTR of that transcript.
  2. COSN10000039 was formerly classified as a non-coding variant in COSMIC, because it was annotated in the intron of a transcript. Since reannotation, this non-coding variant has been reclassified as a coding variant in the MAPK3 gene and has been assigned CDS syntax c.2542-277T>A.

The fusion mutation counts look different

The main fusion page contains a list of fusion pairs which have been curated in COSMIC (https://cancer.sanger.ac.uk/cosmic/fusion). The fourth column in the table now shows the number of curated papers. This number previously included listed papers (papers deemed unsuitable for curation) and review articles, and therefore for some fusion pairs this will be a lower but more accurate number.

The fusion translocation name is not displayed

The translocation names (syntaxes) for Observed RNAs/Related Breakpoints are not displayed on the website but are all visible in the full fusion download (Complete Fusion Export).

The genomic coordinates have changed slightly or don’t match match other resources such as Clinvar

Example: compare the following two variants, which are fact the same deletion event:

COSV53149564 (COSM5010329): 17:7675155..7675166, GRCh38
ClinVar ID 133282: Chr17: 7675153 - 7675164, GRCh38

In COSMIC, we follow the HGVS nomenclature of representing a variant in the most 3' position possible. Consider the following two scenarios for the sequence for chromosome 17:7675152-7675167 on GRCh38:

a. COSMIC representation, after HGVS 3' rule is applied (COSV53149564)
CGGGCGGGGGTGTGGA -> positions 7675155-7675166 deleted gives
CGG------------A
b. Clinvar representation (ClinVar ID 133282)
CGGGCGGGGGTGTGGA -> positions 7675153-7675164 deleted gives
C------------GGA

Both events result in the same end sequence.

Further details can be found on the HGVS website. HGVS rules have been introduced throughout for the v90 release, so comparing variants between previous COSMIC releases and v90 may show short sequence changes.


Variant Updates (updated 14 August 2019)

The COSMIC database is undergoing an extensive update and reannotation, in order to ensure standardisation and modernisation across COSMIC data. This will substantially improve the identification of unique variants that may have been described at the genome, transcript and/or protein level. The introduction of a Genomic Identifier, along with complete annotation across multiple, high quality Ensembl transcripts and improved compliance with current HGVS syntax, will enable variant matching both within COSMIC and across other bioinformatic datasets.

As a result of these updates there will be significant changes in the upcoming releases as we work through this process. The first stage of this work was the introduction of improved HGVS syntax compliance in our May release and is detailed below. The majority of the changes will be reflected in COSMIC v90, which will be released in late August or early September, and the remaining changes will be introduced over the next few releases.

Changes in v90

The significant changes in v90 include:

Key points

COSMIC variants have been annotated on all relevant Ensembl transcripts across both the GRCh37 and GRCh38 assemblies from Ensembl release 93. New genomic identifiers (e.g. COSV56056643) are used, which refers to the variant change at the genomic level rather than gene, transcript or protein level and can thus be used universally. Existing COSM IDs will continue to be supported and will now be referred to as legacy identifiers e.g. COSM476. The legacy identifiers (COSM) are still searchable. In the case of mutations without genomic coordinates, hence without a COSV identifier, COSM identifiers will continue to be used.

All relevant Ensembl transcripts in COSMIC (which have been selected based on Ensembl canonical classification and on the quality of the dataset to include only GENCODE basic transcripts) will now have both accession and version numbers, so that the exact transcript is known, ensuring reproducibility. This also provides transparency and clarity as the data are updated.

How these changes will be reflected in the download files

As we are now mapping all variants on all relevant Ensembl transcripts, the number of rows in the majority of variant download files has increased significantly. In the download files, additional columns are provided including the legacy identifier (COSM) and the new genomic identifier (COSV). An internal mutation identifier is also provided to uniquely represent each mutation, on a specific transcript, on a given assembly build. The accession and version number for each transcript are included. File descriptions listing and explaining the new columns are now available for each of the download files. They will also be available from the downloads page after the v90 release.

Taster files

We have prepared files that contain a sample of the complete dataset for each of the COSMIC download files. This gives you real data in the final v90 file format to manipulate and integrate ahead of the release.

GRCh37 GRCh38
COSMIC tar zip tar zip
Cell lines tar zip tar zip
File descriptions file_descriptions.xlsx

HGVS syntax update (COSMIC release 88)

In this release (v88) we have updated the HGVS nomenclature for many of the manually curated mutations that were published without CDS/genomic information (c.? mutations). Details of how these updated syntaxes are reflected in the data are given below.

Use of X in place of N to indicate unknown amino acid

We are now using X to indicate an unknown amino acid instead of N as per HGVS standards. Many of the more recently curated mutations retain the NNN... notation. These will be updated to the XXX... notation in a future release.

Frameshift mutations

Most manually curated frameshift mutations with unknown CDS change (c.?) now include the first mutant amino acid in the syntax. For example:

c.? / p.C1396Lfs*5
c.? / p.V1833Afs*?
c.? / p.S1303Xfs*58
c.? / p.P463Xfs*?

Frameshift mutations with known genomic/CDS details have not yet been updated and therefore retain the original syntax, for example c.355_356insATGG / p.E121fs*5.

Unknown substitution and insertions

Most missense substitution mutations with no reported CDS change now have the syntax style p.H1904X.

Unknown mutations remain p.H1904?.

Most unknown insertion mutations now have the syntax p.F12_S13insXXX.

Whole gene deletions

Many manually curated whole gene deletions have been updated to the syntax c.1_*del / p.0. A handful remain in the old style c.1_3267del / p.0?.

fs*1 / nonsense mutations

Most manually curated mutations which had a fs*1 (for example p.S123fs*1) syntax have been updated to a substitution nonsense syntax (e.g. p.S123*) and AA mutation type, in keeping with HGVS recommendations. This also applies to a few with known CDS information. However, most with genomic information have not yet been updated and retain the old syntax style, e.g. c.368_369insT / p.N124fs*1.