Frequently Asked Questions (FAQ)
Where does the data in COSMIC come from?
There are two types of data in COSMIC: expert manual curation data and systematic screen data. It is useful to understand the differences of these data types and use them appropriately.
Expert curation data
- Manually input from peer reviewed publications by COSMIC expert curators
- Consists of comprehensive literature curation of selected Cancer Gene Census genes, introduced at a given release then updated at subsequent releases
- Includes additional data points relevant to each disease and publication
- Provides accurate frequency data as mutation negative samples are specified
- Also called non-systematic or targeted screen data
Genome-wide screen data
- Uploaded from publications reporting large scale genome screening data or imported from other databases such as TCGA and ICGC
- Provides unbiased molecular profiling of diseases while covering the whole genome
- Provides objective frequency data by interpreting non-mutant genes across each genome
- Facilitates finding novel driver genes in cancer
Can I download the full COSMIC dataset?
Yes. Files containing all data for each variant type (simple mutations, gene fusions, non-coding, structural,
copy number, expression and methylation variants) are available for download and these are updated for each COSMIC release.
Instructions on how to download, and details of all the files available can be found on the
CosmicMutantExport.tsv.gz contains all the samples analysed for every gene in COSMIC found with mutations.
For targeted gene screens there is also a file CosmicCompleteTargetedScreensMutantExport.tsv.gz containing both positive
and negative data.
Note that you will need to register and login before you can download data.
Where can I download all the mutations from COSMIC?
The file can be found on the download page. It is called CosmicMutantExport.tsv.gz. This file contains all the samples analysed for every gene in COSMIC found with/without mutations.
How can I find the latest COSMIC release version?
The COSMIC home page shows the version number and release date for the current, most recent COSMIC release. You can find more information about the release in the news and release notes sections, which can be found under the News menu on every page.
Which version of the human reference sequence does COSMIC use?
The default version is GRCh38/hg38 but this can be changed to GRCh37/hg19 from the 'Genome Version' menu on the main navigation bar at the top of each page. The version currently selected in the menu will be ticked. When GRCh37 is selected an icon is diplayed in the page header which indicates that the 'GRCh37 Archive' is being used.
How are samples counted in COSMIC?
Each sample has its own name and ID. Multiple instances of the same sample name can exist as separate entries, indicating that it was unclear during curation that these samples were identical, apart from their name. To account for the duplication of probably identical samples during curation, we attempt to combine samples that have identical names and disease descriptions. Please see the help documentation for more details.
What does 'NS' mean in tissue/histology classifications?
NS means 'not specified'.
Can I download data from older versions of COSMIC?
Yes. We make data files available to download for the
current release and for at least the three previous releases only, i.e.
data files released more than one year ago will not be available. Files for the current release are
available from the Download page but previous versions can be downloaded
programatically or using a command line interface.
For instructions please see the 'Useful Links' section at the top right of the Download page or follow the following links -
Downloading using command line tools
Available download files
I am preparing a manuscript for publication and I am including some COSMIC data. How should I cite COSMIC?
We are very happy for you to use the data, and any tabulations or graphic screenshots which support your work. Please cite the website address (cancer.sanger.ac.uk) and the paper COSMIC: the Catalogue of Somatic Mutations in Cancer Thank you.
How often is COSMIC updated?
What is the difference between cell line data in COSMIC and the Cell Line Project?
The Cell Line Project is an in-depth analysis of over a thousand commonly used cancer cell lines, as defined here. However, many more cell lines have been examined in the literature for somatic mutations, and they are all recorded in the standard COSMIC database.
What are Census genes and where is the updated version of the census?
The Cancer Gene Census is a list of genes known to be involved in cancer. They are listed here.
Can I submit data to COSMIC?
All mutation data in COSMIC is currently entered by our curators. If you would like to submit data for one of your publications, or even pre-publication, please contact email@example.com and one of our curators will be happy to help.
How are papers selected from the literature?
To identify papers reporting somatic mutations, PubMed is broadly
searched for papers containing relevant mutation data (example search:
(ras OR genes, ras) AND human AND mutation). Those papers
that are identified from their abstracts to include somatic mutation
information relating to cancer or pre-cancerous conditions are then
selected for curating. After examination of the information in the full
text of the paper, the sample and mutation data are extracted. Any papers
containing incomplete data (e.g. mutations that are reported but not
fully described) or data of insufficient quality (e.g. errors identified
in the data) are not fully curated but are added to a list of "additional
references containing somatic mutation information".
How do I calculate gene mutation frequencies across multiple transcript variants?
Currently COSMIC uses a 'one-to-one' model for genes and transcripts so mutation frequency counts on the website are calculated on a per-transcript basis, rather than across gene loci. As each unique genomic mutation can be annotated to multiple different splice variants of the same gene, adding the prevalence counts of the splice variants together can potentially add the same genomic mutation more than once. However, from v90 (Sept. 2019) we began uniquely identifying variants at the genome level using COSV identifiers. This means that with some basic bioinformatics/data analysis skills it is now possible to calculate frequencies by downloading and analysing complete datasets. A unique list of COSVs for a locus can be extracted from the CosmicCodingMuts.vcf file and the number of unique mutated samples can be counted using the COSV list and the CosmicMutantExport.tsv.gz file. Please also refer to the question 'How do I calculate mutant frequencies' on this page.
What is the difference between census and classic genes?
A census gene is one that is known to be involved in cancer. The list of these genes is used to prioritise the literature curation for the COSMIC database. Once the literature for a census gene has been completely curated, it is released and sometimes termed a 'COSMIC classic' gene.
There are no mutations in the full text of the paper. Are these extracted from the supplementary material?
Yes. We utilize supplementary material for curation when it contains additional information.
What are the rules for mutation syntax in COSMIC?
Mutations are annotated using syntax derived from HGVS nomenclature recommendations (see http://varnomen.hgvs.org).
How do I examine a histology or cancer type?
COSMIC may use an alternative histology terminology, for example small cell carcinoma instead of neuroendocrine carcinoma (or, for some sites, neuroendocrine carcinoma instead of small cell carcinoma). More information about our classification system can be found at the URL below and all COSMIC tumour site and histology translations are available to view as an excel spread sheet or tab delineated text file in the Classification documents found here. Note: You may also want to use our search to find out the matching disease classification for the alternative terminologies.
How do you define Mutation somatic status?
The variant allele from the tumour sample differs from the germ-line alleles of the same individual who provided the tumour sample.
There is no germ-line allele information provided for the tumour sample for the same individual, but the same variant has been found to be 'Confirmed Somatic' variant in a normal-tumour sample pair from another patient.
Please, note that the same variant from multiple samples from the same patient should always get the same somatic status, because all the samples share the same germ-line alleles in individuals who are not genetic mosaics.
Variant of unknown origin:
There is no information provided on the germ-line alleles in the data source to help determine if it is either a germ-line or somatic variant.
How do I examine colon cancer?
Large Intestine for this cancer site. You can
find information about our classification system and download all COSMIC
tumour site and histology translations as an Excel spreadsheet or
tab-delineated text file in the
Classification documentation page.
How do I examine a tumour site?
COSMIC may use an alternative site eg Colon versus Large Intestine. More information about our classification system can be found at the URL below and all COSMIC tumour site and histology translations are available to view as an excel spread sheet or tab delineated text file in the Classification documents found here.
Why is my search bringing back fewer records than expected?
Check you have not got a filter (displayed in the sidebar) on unexpectedly, limiting the gene region, tumour type or site etc. Some genes will have been curated as part of systematic screens so will have some data in COSMIC but have not yet been manually curated so will have less data than has been published. This may be because they are in the list of genes waiting to be manually curated, or they are not included in the Cancer Gene Census.
Where can I find patient age information?
If a paper gives the precise age then this is entered and displayed in years to 2 decimal places in the sample overview page, for example COSS1735169. Less precise age information is added as a remark and displayed on the sample overview page as, for example, Age=Adult; Age=Child; Age=Elderly; Age=young adult; Age=more than 65 years; Age=Adult 20-60 years. For an example see COSS1757821. If the paper uses term “paediatric” this is added as a remark Age=Child. In the past some papers reporting paediatric or adult leukaemias have had this information included in the Tumour remark section. This information is now included in the Individual Remark section Age=Child or Age=Adult as described above.
Has the whole gene been screened?
Not necessarily. Sometimes the entire coding sequence and the intron-exon boundaries of a gene will have been screened, but at other times only specific exons, codons or a specific single nucleotide change in 1 codon will have been analysed. This information is not visible in COSMIC but can be obtained from the original publication from which the data was extracted.
How can I tell what part of a gene has been screened?
Sometimes the entire coding sequence and the intron-exon boundaries of a gene will have been screened, but at other times only specific exons, codons or a specific single nucleotide change in 1 codon will have been analysed. This information is not visible in COSMIC but can be obtained from the original publication from which the data was extracted.
Is my gene fully curated in COSMIC? How are genes selected for manual curation in COSMIC?
As new cancer genes are identified from the literature these are added to the Cancer Gene Census list. A gene which is not currently in the manually curated Classic Gene List may be awaiting completion of the initial curation process, thus the data will not yet have been released; a gene may not have been confirmed as a true cancer gene according to our selection criteria and is awaiting more evidence; alternatively we may have missed the gene in question. We welcome suggestions for missing genes at firstname.lastname@example.org.
Why can’t I find a particular publication in COSMIC?
Publications are identified for manual curation of genes from the Classic Gene List by using weekly PubMed and PubCrawler searches. If data from a specific publication is missing it may have been missed from these searches or the paper may be awaiting curation, especially for some of the older well known cancer genes. Alternatively, the publication may be recorded in COSMIC but as a reference only if, for instance, the data was unclear or not presented in a format which was compatible with the COSMIC data entry system. We welcome suggestions concerning missing publications at email@example.com.
Is mutation data from cell lines included in COSMIC?
Cell lines are included in COSMIC if they have been screened for mutations. You might also want to check the COSMIC Cell Line Project, where the genetics and genomics of large numbers of cancer cell lines have been systematically characterised.
Are mutations analysed by immunohistochemistry included in COSMIC?
Mutations analysed solely by immunohistochemistry using mutation specific antibodies are not currently included in COSMIC.
Why does COSMIC contain data on overgrowth syndromes as they are not really cancer?
Somatic mutations detected in tissues associated with overgrowth syndromes such as Proteus and Cloves syndromes are included in COSMIC. Not all somatic mutations give a growth advantage to the cells but the mutations that have been identified in context of these syndromes clearly do. Including these mutations in COSMIC will help us further define and understand cancer.
What does Inferred Breakpoint mean?
This is the genomic breakpoint for a gene fusion. For many fusions this is not reported in detail so it is necessary to infer the position based on the reported mRNA transcripts in a given sample. To do this, it is assumed that each sample's breakpoint lies between the most 3' expressed exon of the 5' gene and the most 5' exon of the 3' gene, from the mRNAs reported in that sample. However, if the genomic breakpoint position is reported in detail for the sample then this is input as the Inferred Breakpoint.
What does Observed mRNA transcript mean?
Many papers determine fusions between genes using expression technologies such as RT-PCR. A number of these studies have identified more than one transcript per sample, some finding over four different products between the same gene pair in one tumour. This implies significant alternative splicing of the mRNAs expressed from the fused gene pair. These alternative transcripts are input as Observed mRNA transcripts.
What are Related Breakpoints?
These are either all the Inferred Breakpoints for a selected mRNA transcript mutation, or all the Observed mRNA transcripts for a selected inferred breakpoint mutation.
What is a Translocation Name?
This is the syntax format describing the portions of mRNA present (in HGVS "r." format) from each gene in a fusion.
How is an inverted sequence annotated in a fusion?
An "o" before a gene name is used to indicate an inverted sequence, e.g.
Why can't I find any information in COSMIC on a particular gene fusion pair?
The curation of fusion data is on-going and the list of fusions currently curated in COSMIC can be found here. Sometimes an alternative transcript needs to be used to annotate a fusion so it may be necessary to search all transcripts for a gene to find any curated for fusions e.g. NOTCH1 and NOTCH1_ENST00000277541.
How do I calculate mutant frequencies?
Positives _____________________ x 100 Positives + Negatives
However, whole genome screen data is not included.File 2. Lists samples with mutations (positives) from whole genome screens but samples without mutations (negatives) are not included. The number of samples analysed by whole genome screening can be extracted from File 3. by selecting rows where the 'whole genome screen' column is equal to 'y'. Frequencies can be calculated as follows -
Positives in File 2 _____________________________________ x 100 Whole Genome Screen Samples in File 3For the total dataset (targeted and whole genome screens), calculate the frequencies as follows -
Positives in File 1 + Positives in File 2 ___________________________________________________________________________________ x 100 Negatives in File 1 + Positives in File 1 + Whole Genome Screen Samples in File 3To calculate frequencies for a specific tissue/histology of interest, the sample set must be restricted to those matching the tissue/histology classification. Please see the help section on Sample Counting for a description of how we count samples in COSMIC, and also the Mutation Frequency section on the same page.
What does the term 'Whole Genome Screen' mean?
We use the term 'Whole Genome Screen' to describe any study which has surveyed all genes in the genome, in contrast to a 'Targeted Screen' which surveys a smaller subset of genes. This term does not differentiate between whole genome sequencing and whole exome sequencing.
How are mutations mapped to gene sequences?
We attempt to map every mutation to a single version of a gene, but where this is not possible we map to an alternative transcript. The gene sequences are held in COSMIC and available in the download section.
What mutation detection method was employed?
Mutation screening methods differ in their sensitivity and the sensitivity of a particular method can vary from laboratory to laboratory. Some methods identify all classes of small intragenic mutation (base substitutions and small insertions/deletions). However, the protein truncation test will not detect mutations that cause missense amino acid substitutions.
Was the whole gene screened?
Some genes are characterised by mutation hot spots, for example BRAF, RAS and TP53. These genes are often screened for somatic mutations only in the region most likely to contain mutations. This strategy will obviously miss mutations located elsewhere in the gene and hence will provide a distorted view of the distribution of mutations in the gene and perhaps underestimate the frequency of mutations.
Has the sample been screened before?
There are examples where the same data is reported twice, perhaps in a follow-up study with reference to further data or as a positive control, for example using cell lines with known mutations. Where possible we have noted sample names and within papers have removed any redundancy. However between papers it is not possible to confirm two samples with the same name are indeed the same sample. We have therefore included both samples and both results in COSMIC. If you want to review this information the sample name, mutation and paper reference are displayed in the Mutation Details view.
Are all the mutations real?
For many putative somatic mutations that have been reported in the published literature, definitive evidence that they are somatically acquired (through demonstration of their absence in normal DNA from the same individual as the tumour) is not available. Therefore, occasional germline variants may have inadvertently been represented in publications as somatic mutations and entered in the database. In addition, simple laboratory errors which result in an incorrect normal DNA sample (i.e. from a different individual) being analysed as a control for a particular tumour sample may provide apparently persuasive, but misleading, evidence of somatic origin. Finally, DNA amplification methods have an intrinsic error rate, and these errors may subsequently be interpreted as somatic mutations. There is some evidence that this may be a particular problem in analyses of archival formalin-fixed, paraffin embedded material.