Data Portal

The Catalogue Of Somatic Mutations In Cancer (COSMIC) is a comprehensive database of somatic mutations. This dataset can be examined in following multiple ways,
  1. COSMIC Website: http://cancer.sanger.ac.uk/cosmic
  2. Downloads: Cosmic Data Download
  3. COSMICMart:COSMICMart
  4. Oracle database export: Please contact the COSMIC team for database dump files.

COSMIC Overview

There are a number of important components to the COSMIC website. The 'Site Tour' tutorial below descibes these components

COSMIC Search

COSMIC's home page provides options to search the website through multiple entry points. Searching gains access to web pages where the data set can be examined with the help of various graphical and tabular views. Along with this, the home page provides information about the latest updates in COSMIC with current statistics and links to the additional CGP (Cancer Genome Project) resources.

Search

COSMIC can be searched in several ways. For example, by

  1. Gene name or HUGO synonym (eg BRAF or B-raf)
  2. Tissue or cancer type such as 'lung' or 'colon' (classified in COSMIC as 'large intestine')
  3. Mutation description eg the common KRAS mutation "c.35G>A" (CDS styntax) or "p.G12D" (Amino acid syntax)
  4. Combined gene and mutation description eg "KRAS p.G12D"
  5. Sample name such as 'COLO-829' or a Cosmic Sample Id eg '687448'

After searching, the results are listed by category with a table showing the number of hits and a panel underneath with a tab for each category. The listings in each category have links to the relevant overview pages in the COSMIC website.

Search By "Gene"

The gene search finds matching gene names or transcript names (even if partially known). There are 2 options "Exact Match" and "All Matches", by default the "Exact Match" option is selected. If the gene name "TP53" is entered with the "Exact match" option it will take the user directly to the "TP53" gene overview page.

Alternatively, if the "All matches" option is selected and "TP53" is searched , it displays all TP53 matches on an intermediate page.

This intermediate page has 3 tabs:

  1. Census - gene names in the Cancer Gene Census (a list of known cancer genes [more details]).
  2. Mutations - gene names outside of the cancer gene census, with mutations.
  3. No Mutations - gene names outside of the cancer gene census, with no mutations.

Search by "Sample"

The sample search finds matching sample names. There are three options to restrict the search

  1. All Samples - finds samples which are either tumour or cell line
  2. Tumour sample - finds tumour samples only
  3. Cell lines - finds cell lines only

For example, "PC-3" has mutliple entries in COSMIC with some being tumours and others cell lines. To view only the cell lines with the name PC-3 select the 'Cell Lines' option before searching

Searching will list all the samples in COSMIC with the same sample name but with different id's; each linking to the sample overview page for more details.

Note : for more details on samples, please follow this link.

Search By "Cancer"

Follow this link to use the Cancer Browser where a list of primary tissue types is available for selection to view tissue/disease specific mutation frequencies with links to genes, mutations and sample details.

Sample Counts in COSMIC

A sample is a cell line or single piece of tumour examined through one or more genes for mutations. These experiments can happen in a number of ways, but usually involve sequencing. The name of the sample is defined by the data source. Usually cell lines have recognised names (which we capture) such as 'HCC38', or 'PC-3'. Names of primary tumours are often more abstract, sometimes numeric ('1','2'...), and often completely absent, in which case they are assigned a 6 or 7-digit name reflecting their database ID. Multiple instances of the same sample name can exist as separate entries, indicating that it was unclear during curation that these samples were identical, apart from their name. This is especially acute for cell lines, where the same sample name can indicate very different biological material, for instance the name 'PC-3' (http://cancer.sanger.ac.uk/cosmic/sample/overview?name=PC-3) is used for cell lines from 3 different tissues.

A number of tumours can be examined form a single cancer patient, and a number of samples can be examined from each of these tumours. Each sample has its own name and ID. Their identical ancestry is indicated how?

Sample counting

To account for the duplication of probably identical samples during curation, we attempt to combine samples with identical names and disease descriptions. For instance these two PC-3's will be counted as one (in mutation frequency calculations) since it's likely they're the same thing, just curated from different papers:

    
        Sample id     Name  Primary site( s ) 
        COSS1028650   PC-3  prostate  
        COSS1028702   PC-3  prostate  
    

Mutation Frequency

The mutation frequency of a gene or tissue on the COSMIC webpages is a simple division of the number of samples with observed mutations, over the number of samples examined, from our curations. There are two different contexts for this data, between the published literature and the Cancer Genome Consortium data. The Cancer Genome Consortium data can be considered fully objective, where every gene has been fully sequenced through every sample. However, for the genes with full literature curation (http://cancer.sanger.ac.uk/cancergenome/projects/classic/), the % frequencies will reflect the samples and mutations as they are published. Since it is more difficult to publish studies which find no mutations, it is likely these frequencies are less accurate, simply representing the best current knowledge.

Data Types

Aberrant Gene Expression

Gene expression level 3 data has been downloaded from the publicly accessible TCGA portal.
The platform codes currently used to produce the COSMIC gene expression values are:

     IlluminaGA_RNASeqV2
     IlluminaHiSeq_RNASeqV2
     AgilentG4502A_07_2
     AgilentG4502A_07_3

For the Agilent platforms, the two samples (one from the target sample and the other 
from reference sample) are labeled with Cy3 and Cy5 and mixed, and then hybridized to 
a single microarray. The relative intensities of Cy3 and Cy5 are then used in ratio-based
analysis to identify over expressed and under expressed genes 
[https://wiki.nci.nih.gov/pages/viewpage.action?pageId=72942598]. For the RNASeqV2 platforms, the files used are rsem.genes.normalized_results, which contain Level 3 expression data produced using MapSplice to do the alignment and RSEM to perform the quantitation. [https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2] Analysis The mean and sample standard deviation of the gene expression values have been calculated from the Tumour samples that are diploid for each corresponding gene, platform, study. Based on these mean and STDEV values we have calculated the standard scores (z_score) for gene expression for each corresponding gene, platform, and study. Qualitative merging of results Qualitative merging of results, per study(project_code) across analysis platforms. In order to display if a gene is over or under expressed, a threshold of 2 STDEV, plus or minus was selected. In the cases that a sample has been analysed with more than one platform for the specific study and gene where the scores from all platforms are above or below the threshold then we display over or under. If they do not agree then we do not display it. The z_score column in the gene_regulation table is the z_score (serves as an indicative score) taken from the gene_expression table, from platforms in order of preference: IlluminaHiSeq_RNASeqV2 IlluminaGA_RNASeqV2 AgilentG4502A_07_3

Copy Number Variants

For Cancer Genome Project data (including the Cell Lines Project) copy number analysis was carried 
out using the Affymetrix SNP6.0 array in conjunction with a bespoke algorithm PICNIC (Predicting Integral 
Copy Numbers In Cancer) [http://www.sanger.ac.uk/resources/software/picnic]

Where available, copy number data from TCGA and ICGC have been included in COSMIC (for samples already 
present in the database ie samples with mutations). All TCGA data included in COSMIC has been reanalysed 
using ASCAT [http://heim.ifi.uio.no/bioinf/Projects/ASCAT]

	    
Definition of Minor Allele and Copy Number in tables:

      Minor Allele: the number of copies of the least frequent allele 
                    eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies)

      Copy Number:  the sum of the major and minor allele counts 
                    eg if ABB, copy number = 3
      

Definition of Gain and Loss:
We have introduced filtering thresholds to only display CNVs which are high level amplifications,
homozygous deletions, or where there has been 'substantial loss' within an otheriwse duplicated genome.
We also use a higher threshold for amplification if genome duplication has occurred. 
We use average ploidy > 2.7 to define genome duplication.

        
     1. ICGC samples

            Gain:  as defined in the original data

            Loss:  as defined in the original data

     2. TCGA samples reanalysed with ASCAT and CGP Cell Lines exomes analysed with PICNIC

            Gain:  average genome ploidy <= 2.7 AND total copy number >= 5
 		   OR average genome ploidy > 2.7  AND total copy number >= 9
                    
            Loss:  average genome ploidy <= 2.7 AND total copy number = 0
 		   OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )

Substitutions


Substitutions involve the substitution of a single nucleotide and they are annotated 
using syntax derived from HGVS nomenclature recommendations 
[http://www.hgvs.org/mutnomen/].

In COSMIC v70 (August 2014) we have applied filtering to the dataset. We have excluded 
data from any sample with over 15,000 mutations. In addition, we have flagged all 
known SNPs as defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) 
samples from Sanger CGP sequencing. Although all data are included in our download files, 
we have excluded flagged mutations from the website.

AA Mutation
The change that has occurred in the peptide sequence as a result of the mutation. 
Syntax is based on the recommendations made by the Human Genome Variation Society. 
The mutation type is shown in brackets after the mutation string. 
A description of each type can be found below in the section entitled Mutation Type.

CDS Mutation
The change that has occurred in the nucleotide sequence as a result of the mutation. 
Syntax is identical to the method used for the peptide sequence.

The mutation type is used to describe the type of mutation that has occurred.

Mutation Types:


    Nonsense :      A substitution mutation resulting in a termination codon, 
                    foreshortening the translated peptide.

    Missense :      A substitution mutation resulting in an alternate codon, 
                    altering the amino acid at this position only.

    Coding silent : A synonymous substitution mutation which encodes the same 
                    amino acid as the wild type codon.

    Intronic :      A substitution mutation outside the coding domains. No interpretation is 
                    made as to its effect on splice sites or nearby regulatory regions.

    Complex :       A compound mutation which may involve multiple insertions, deletions 
                    and substitutions.

    Unknown :       A mutation with no detailed information available.      

Insertions/Deletions


Insertions and Deletions are annotated using syntax derived from HGVS nomenclature 
recommendations [http://www.hgvs.org/mutnomen/].

Insertion
An insertion of novel sequence into the gene.

    In frame :   An insertion of nucleotides which does not affect the gene's translation frame, 
                 leaving the downstream peptide sequence intact.

    Frameshift : An insertion of novel sequence which alters the translation frame, changing the 
                 downstream peptide sequence (often resulting in premature termination).

Deletion               
A deletion of a portion of the gene's sequence.

    In frame :    A deletion of nucleotides which does not affect the gene's translation frame, 
                  leaving the downstream peptide sequence intact.

    Frameshift :  A deletion of nucleotides which alters the translation frame, changing the 
                  downstream peptide sequence (often resulting in premature termination).

Structural Variants

The accurate description and annotation of structural variants can be complex. 
This is due to the different resolution that variants are reported from traditional 
cytogenetic coordinates down to the actual base pair positions. Furthermore, multiple 
rearrangements in a single area of the genome can make cataloguing and interpreting 
their effects challenging. 

The Rearrangement Overview page describes the one or more breakpoints which make up a structural 
variant. A breakpoint is defined as a region or point where the sample sequence has altered
from the reference sequence. Minimum interpretation is made of this data. One variant event
can consist of one or multiple breakpoints. The Syntax (shown above the table) gives a detailed 
description of the variant and its location  (e.g. chr11:g.36585230_76606619del, a deletion of 
roughly 40Mb on chromosome 11). Syntax is based on HGVS mutation nomenclature recommendations
[http://www.hgvs.org/rec.html]. 

In the table of breakpoints, the columns are as follows:-

    Mutation ID (COST)         Unique identifier for the variant

    Mutation Description       A short textual description of the variant 
                               (e.g. tandem duplication, deletion, translocation)

    Order	               For a structural variant involving multiple breakpoints, 
                               the predicted order along chromosome(s) is given (otherwise '0').

    Chromosome From 	       Gives the chromosome where the first variant/breakpoint occurs.

    Breakpoint From            Genomic coordinate of the start of the variant/breakpoint 
                               (or range if base position not known). The icons next to the 
                               coordinate are links to the COSMIC Genome Browser and Ensembl.

    Strand 	       	       Orientation of the break relative to the reference sequence.

    Chromosome To 	       Gives the chromosome where the final variant/breakpoint occurs.

    Breakpoint To 	       Genomic coordinate of the last variant/breakpoint 
                               (or range if base position not known).
			       The icons next to the coordinate are links to the COSMIC Genome 
                               Browser and Ensembl.

    Strand 	       	       Orientation of the break relative to the reference sequence.

    Non Templated Inserted Seq Sequence (if any) which is inserted at the breakpoint. The sequence
                               is not encoded.


A controlled Ontology of "Mutation Descriptions" are available below. 

Mutation Description Ontology
In order to help with the interpretation of structural variants in COSMIC, each variant is assigned 
a Mutation Description and Syntax. When the assignment takes place there is an interpretation
of the data and the currently known breakpoints in the region. If not all breakpoints have been 
characterised then the mutation may not be fully characterised. Below is a description of the Mutation 
Description Ontology with associated Syntax.

  1. Tandem Duplication
     A Tandem Duplication is characterised by a duplication of a segment of the genome which is 
     adjacent to the original sequence. The syntax takes the following format:

        chr2:g.124629221_125036287dup 

     where chr2: denotes the chromosome involved, g. genomic coordinates used, 124629221_125036287 
     the start and end of the variant, dup indicates tandem duplication.

     For a tandem duplication the breakpoint is characterised by upstream sequence mapping 
     downstream to where it should map on the genome. So in this case position 125036287 is mapping 
     before 124629221 which is the signature of a tandem duplication. 

  2. Deletion
   The syntax takes the following format:

       chr11:g.36585231_76606618del

   where chr11: denotes 
   the chromosome involved g. for genomic coordinates, 36585231 for the deletion start point, 76606618 
   for deletion end point and del indicates a deletion event.

   For a deletion the breakpoint is characterised by 2 distant points in the genome being next to each 
   other. In this example position 36585230 is next to 76606619 in the genome. The region between these 
   points is assumed to be deleted. The coordinates of the deletion are +1 and -1 as the breakpoint gives 
   the last observed nucleotides, so the range of the deletion is from 36585231 to 76606618.

  3. Inversion
   An inversion indicates the reversal of a piece of genome sequence. The syntax takes the following 
   format:

       chr1:g.115340245_115346449inv 

   where chr1: denotes the chromosome involved g. genomic coordinates used, 115340245_115346449 the 
   range of the inversion, and inv indicates an inversion.

   Two breakpoints can be detected for this mutation although only one is required to fully characterise 
   the mutation. 

  4. Translocation
   A Translocation is characterised by the fusion of 2 chromosomes. The syntax takes the following format:
 
      chr8:g.63669858_chr14:22298219trans[?] 

   Where chr8:g.63669858 denotes the breakpoint on one chromosome, and chr14:22298219 on the other chromosome, 
   trans indicates a translocation event, [?] indicates if there is any change in copy number associated 
   with the mutation. [?] indicates not known. 

   The strand information is often given in the syntax to describe which end of each chromosome actually 
   forms the translocation.

  5. Complex Substitution
   A Complex Substitution is defined as a region which been deleted and replaced with another region of 
   the genome. The syntax takes the following format:

      chr8:g.55512043_63659930>chr13:22017510_22017585 

   where chr8: denotes the chromosome involved g. for genomic coordinates, 55512043_63659930 indicates the 
   region deleted, > represents replaced with, chr13:22017510_22017585 indicates the region inserted.

  6. Complex Amplicon
   A Complex Amplicon is a region of a genome which has been amplified and undergone multiple rearrangements. 
   Due to the complexity of these regions the amplicon breakpoints are listed but no interpretation is 
   made of the data.

   The syntax gives the range of the amplicon where the multiple rearrangements are occurring. 
   An example is 

      chr8:g.(61857345-?_129022677+?)[(10-40)] 

   where chr8: denotes the chromosome involved g. for genomic coordinates, 61857345-?_129022677+? 
   indicates the range of the amplicon with -? and +? indicating the precise position of the 
   start (-?) and end (+?) are not currently known, [(10-40)] indicates the approximate copy number 
   of this region, between 10 and 40 copies in this case. 

  7. Amplicon Breakpoint(s)
   An amplicon breakpoint is defined as a breakpoint within an amplified region with unknown boundaries 
   so accurate interpretation of the mutation cannot be made. In these cases the breakpoint is simply 
   described. The syntax takes the following format:

      chr14:g.28412748_chr14:28419493bkpt[4] 

   where chr14: denotes the chromosome involved g. for genomic coordinates, 28412748 is the end of the 
   sequence to the left of the breakpoint and 28419493 is the sequence coordinate to the right of the 
   breakpoint, bkpt indicates a breakpoint, and [4] the approximate copy number in the area.

Sequence Fragment(s)
Structural variants can have additional sequence from elsewhere in the genome.
For example:

    chr8:g.64123513inschr12:7418993_7419327inschr12:8232312_8232333_chr12:7072996trans[(8-13)] 

is a translocation with 2 additional fragments from chromosome 12, one is 21 base pairs and the 
other 335 base pairs.
   
Copy Number Information
Approximate Copy Number data is given when the variant is non-diploid and this information is available. 
The mutation description is prefixed with "amplified" or "amplicon" if there is variation in copy number. 
For example chr8:g.63669858_chr14:22298219trans[11-26] denotes a translocation with a copy number increase 
of approximately 11-26. A value of [2] would indicate diploid (normal).

Strand Information
In certain situations it is important to provide strand information to describe a variant. 
The HGVS "o" identifier is used to denote 'opposite strand'.

For example:

    chr1:g.58958334_chr12:o69893440bkpt

Non-Coding Variants

Non-coding variants are usually defined by whole genome screening and occur either in 
the intronic regions of genes or in intergenic regions of the genome. 

They are annotated using syntax derived from HGVS nomenclature recommendations 
[http://www.hgvs.org/mutnomen/].

The 'g.' format of the syntax denotes genomic coordinates, eg chr19:g.34210730C>T 
which is a C to T substitution at nucleotide 34,210,730 on chromosome 19.

Gene Fusions

Many papers determine fusions between genes (translocations) using expression technologies,
such as RT-PCR. 

A number of these studies have identified more than one transcript per sample, some finding 
over four different products between the same gene pair in one tumour. This implies significant 
alternative splicing of the mRNAs expressed from the fused gene pair. In order to simplify this 
data for display and navigation, we have inferred the position of the genomic breakpoint 
from the experimental data whilst maintaining the original results.

To do this, it has been assumed that each sample's breakpoint lies between the most 3' expressed 
exon of the 5' gene and the most 5' exon of the 3' gene, from the mRNAs reported in that sample. 

Inferred breakpoints are determined using the rule above, and the 'Observed mRNAs' are the expressed 
products actually reported as the result of experimental procedures. A single inferred breakpoint 
can allow the expression of a number of gene fusion mRNA variants, as above. However, additionally, 
a single observed mRNA can, between samples, be derived from a number of different breakpoints. 

Syntax
Syntax format describing the portions of mRNA PRESENT (in HGVS "r." format) from each gene 
(allows representation of UTR sequences). This is a one line syntax: 


Gene name 1                  HUGO
{                            new symbol associating mRNA sequence with gene name
Accession number 1           Genbank
}                            new symbol associating mRNA sequence with gene name
:                            separates gene identifier from coordinates
r.                           syntax defining mRNA portion present of first gene
_                            denotes a join of sequences
Gene name 2                  HUGO
{                            new symbol associating mRNA sequence with gene name
Accession number 2           Genbank
}                            new symbol associating mRNA sequence with gene name
:                            separates gene identifier from coordinates
r.                           syntax defining mRNA portion present of second gene


Here are 2 examples:

    1. Standard Fusion

    TMPRSS2 from exon 1 (UTR) to ERG exon 2 (inclusive).

    TMPRSS2{NM_005656.2}:r.1_71_ERG{NM_004449.3}:r.38_3097


    TMPRSS2 from intron after exon 1 to intron before exon 2, intronic 
    breakpoints known (374bp downstream of TMPRSS2 exon 1 and 54bp 
    upstream of ERG exon 2).

    TMPRSS2{NM_005656.2}:r.1_71+374_ERG{NM_004449.3}:r.38-54_3097


    TRMPSS2 from intron after exon 5 to intron before ERG exon 3, 
    intronic breakpoints NOT known (but remarked on in the paper).

    TMPRSS2{NM_005656.2}:r.1_71+?_ERG{NM_004449.3}:r.38-?_3097

   2. Fusion to the complimentary strand (flipped fusion)

    TMPRSS2 present in sense orientation, ERG in the antisense.

    TMPRSS2{NM_005656.2}:r.1_71_oERG{NM_004449.3}:r.38_3097


    Again, if the intronic co-ordinates are known.

    TMPRSS2{NM_005656.2}:r.1_71+374_oERG{NM_004449.3}:r.35-54_3097