What is COSMIC?

All cancers arise as a result of the acquisition of a series of fixed DNA sequence abnormalities, mutations, many of which ultimately confer a growth advantage upon the cells in which they have occurred. There is a vast amount of information available in the published scientific literature about these changes. COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers.

Some key features of COSMIC are:

  • Contains information on publications, samples and mutations. Includes samples which have been found to be negative for mutations during screening therefore enabling frequency data to be calculated for mutations in different genes in different cancer types.
  • Samples entered include benign neoplasms and other benign proliferations, in situ and invasive tumours, recurrences, metastases and cancer cell lines.

The mutation data and associated information is extracted from the primary literature and entered into the COSMIC database. We also upload data from The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) portals. In order to provide a consistent view of the data a histology and tissue ontology has been created (click the 'Classification' tab on this page for more details). We attempt to map every mutation to a single version of a gene, but where this is not possible we map to an alternative transcript. The data can be queried by tissue, histology or gene and displayed as a graph, as a table or exported in various formats.

How does COSMIC work?

Gene selection

We have assembled a list of genes that are somatically mutated and causally implicated in human cancer (Futreal et al, 2004). We call this list the The Cancer Gene Census and it is updated periodically with new genes. From this list we are selecting genes for entry in to COSMIC with an emphasis on genes for which there are no existing databases.

Gene sequences

We attempt to map every mutation to a single version of a gene, but where this is not possible we map to an alternative transcript. The gene sequences are held in COSMIC and available in the Download section here.

Selecting papers from the literature

To identify papers reporting somatic mutations PubMed is broadly searched for papers containing relevant mutation data (example search: (ras OR genes, ras) AND human AND mutation). Those identified from their abstracts to include somatic mutation information relating to cancer or pre-cancerous conditions are then selected for curating. After examination of the information in the full text of the paper, the sample and mutation data are extracted. Any papers containing incomplete data (e.g. mutations that are reported but not fully described) or data of insufficient quality (e.g. errors identified in the data) are not fully curated but are added to a list of "additional references containing somatic mutation information".

Mutation frequency

A central aim of COSMIC is to provide somatic mutation frequencies. These are available in the main display windows. However, it is important to understand how they are calculated and possible limitations of the data.

Has the sample been screened before?

There are examples where the same data is reported twice, perhaps in a follow-up study with reference to further data or as a positive control, for example using cell lines with known mutations. Where possible we have noted sample names and within papers have removed any redundancy. However between papers it is not possible to confirm two samples with the same name are indeed the same sample. We have therefore included both samples and both results in COSMIC. If you want to review this information the sample name, mutation and paper reference are displayed in the Mutation Details view.

What mutation detection method was employed?

Mutation screening methods differ in their sensitivity and the sensitivity of a particular method can vary from laboratory to laboratory. Most methods identify all classes of small intragenic mutation (base substitutions and small insertions/deletions). However, the protein truncation test will not detect mutations that cause missense amino acid substitutions.

Was the whole gene screened?

Some genes are characterised by mutation hot spots, for example BRAF, RAS and TP53. These genes are often screened for somatic mutations only in the region most likely to contain mutations. This strategy will obviously miss mutations located elsewhere in the gene and hence will provide a distorted view of the distribution of mutations in the gene and perhaps underestimate the frequency of mutations.

Are all the mutations real?

For many putative somatic mutations that have been reported in the published literature, definitive evidence that they are somatically acquired (through demonstration of their absence in normal DNA from the same individual as the tumour) is not available. Therefore, occasional germline variants may have inadvertently been represented in publications as somatic mutations and entered in the database. In addition, simple laboratory errors which result in an incorrect normal DNA sample (ie from a different individual) being analysed as a control for a particular tumour sample may provide apparently persuasive, but misleading, evidence of somatic origin. Finally, DNA amplification methods have an intrinsic error rate, and these errors may subsequently be interpreted as somatic mutations. There is some evidence that this may be a particular problem in analyses of archival formalin-fixed, paraffin embedded material.

Classification system

The classification of tumour types and subtypes with somatic mutations in the published literature is extremely variable. Classification systems and terminologies differ between reports and indeed may have changed over time. Rather than simply entering a neoplasm using the term employed in the published report, COSMIC uses its own internal classification system to provide tissue and histology consistency within the database and reduce redundancy. The tissue and histology information in the reviewed papers is translated using the COSMIC classification system before entry into the database. It is possible that in some instances we have misunderstood terminology and hence misclassified mutations. Moreover, some users may not favour our classification. In general, however, we have aimed to retain as much useful information as possible, whilst providing a relatively simple classification with generally understood terminology.

The COSMIC classification system is available as a tab delimited text or Excel file in the Download section below. Every sample is defined by both tissue and histology. The example below shows how a paper definition would be translated into a COSMIC definition.

    Type                  Paper Definition   COSMIC Definition
    Site primary          colon              large intestine
    Site subtype 1        descending         colon
    Site subtype 2        NS                 descending
    Histology             carcinoma          carcinoma
    Histology subtype 1   polypoid type      adenocarcinoma
    Histology subtype 2   with adenoma       NS
    Histology subtype 3   NS                 NS
    

The COSMIC classification system was created in close collaboration with Adrienne Flanagan and Ahmet Dogan from the Royal Free and University College Medical School.

Downloads

Classification Information

Comma Separated Values File

Author Guidelines

We list here some guidelines for authors to help improve the speed of curation for an increasing volume of literature relevant to COSMIC and to ensure the continuing accuracy of our curation. By following these guidelines authors will contribute to the quick and efficient dissemination of their research results via COSMIC.

Samples

  • • Data is curated in COSMIC on a per sample basis so mutation or clinical data can only be entered in detail if it has been provided by the author on this basis. The minimum requirement for a sample to be included is that it has full mutation details provided at either nucleotide or protein level (see example below).
  • • If some samples in a paper have already been screened for some or all of the reported genes in an earlier publication we would exclude these from the curation of the new paper in order to avoid duplication. To do this it is helpful if the duplicate samples have been highlighted in some way by the author.
  • • For papers reporting samples with more than one mutation, from one or more genes, we need to know which specific mutations occur together in any given sample.

Reference sequences

  • • It is much easier and quicker for us to map reported mutations to COSMIC reference sequences if the author has stated which reference sequence and version was used to describe their reported mutations (e.g. NM_006015.3 or ENST00000215919). For some genes this information is essential for mutation curation.

Mutations

  • • COSMIC mutation syntax is based on the Human Genome Variation Society recommendations so it is useful if authors also use this nomenclature.
  • • Ideally mutations would be described both at the nucleotide and amino acid levels. This is not so important for well characterised mutations (e.g. BRAF c.1799T>A, p.V600E) but is important for novel mutations so that we can confirm the mutation position on our reference sequence.
  • • For insertion mutations it is very helpful if they are described as e.g. c.1118_1119insA rather than c.1118insA, which can be ambiguous, and if the protein result e.g. p.N373fs*6 is also provided so the position can be confirmed.
  • • For frameshift mutations it is helpful for curation if they can at least be identified as either insertions or deletions, even if no nucleotide details can be provided.

Author guidelines - suggested presentation of results

Sample (or Patient) ID* Patient age*  Patient gender *  Primary tissue  Primary subsite Tumour source*      Primary histology Subhistology  Stage*  CDS mutation  AA mutation
1                       45            F                 colon           left            primary             adenoma           villous               c.34G>T       p.G12C
2                       67            F                 rectum                          primary             adenoma           tubular               c.183A>T      p.Q61H
3                       51            M                 colon           sigmoid         metastasis (rectum) carcinoma                       IIA     c.38_39GC>AT  p.G13D

Reference sequence: KRAS NM_004985.3

*These additional clinical details can be added if data are available.

Additional columns could be added for further information e.g. smoking status, drug response, etc.

We hope you find these guidelines useful and if you have further questions please contact us at cosmic@sanger.ac.uk.

Regards,

The COSMIC literature curators