Frequently Asked Questions (FAQ)
Can I download the full COSMIC dataset ?
Yes. The export sheets are updated with each COSMIC release. The file can be found here: CosmicMutantExport.tsv.gz. This file contains all the samples analysed for every gene in COSMIC found with mutations. For targeted gene screens there is also a file (CosmicCompleteTargetedScreensMutantExport.tsv.gz) containing negative data. Note that you will need to register and and login before you can download data.
Where can I download all the mutations from COSMIC ?
Yes. The file can be found here: CosmicMutantExport.tsv.gz. This file contains all the samples analysed for every gene in COSMIC found with/without mutations.
How can I find the latest COSMIC release version ?
The COSMIC home page shows the version number and release date for the current, most recent COSMIC release. You can find more information about the release in the news and release notes sections, which can be found under the News menu on every page.
Which version of the human reference sequence does COSMIC use ?
From release v72 onwards we are using GRCh38/hg38 on the main COSMIC website but we also host an archive site at http://grch37-cancer.sanger.ac.uk. This archive site displays coordinates on the GRCh37/hg19 reference sequence. We will support this site by showing GRCh37 coordinates where they are available, but we will not be remapping new data generated on GRCh38.
How are samples counted in COSMIC ?
Each sample has its own name and ID. Multiple instances of the same sample name can exist as separate entries, indicating that it was unclear during curation that these samples were identical, apart from their name. To account for the duplication of probably identical samples during curation, we attempt to combine samples that have identical names and disease descriptions. Please see the help documentation for more details.
What does 'NS' mean in tissue/histology classifications ?
NS means 'not specified'.
Can I download data from older versions of COSMIC ?
Yes. We make data files available to download on our SFTP site for the current release and for at least the three previous releases only, i.e. data files released more than one year ago will not be available.
I am preparing a manuscript for publication and I am including some COSMIC data. How should I cite COSMIC ?
We are very happy for you to use the data, and any tabulations or graphic screenshots which support your work. Please cite the website address (cancer.sanger.ac.uk) and the paper COSMIC: somatic cancer genetics at high-resolution. Thank you.
How often is COSMIC updated ?
COSMIC is updated once every three months; see the news item for what's in the new version.
What is the difference between cell line data in COSMIC and the Cell Line Project ?
The Cell Line Project is an in-depth analysis of over a thousand commonly used cancer cell lines, as defined here. However, many more cell lines have been examined in the literature for somatic mutations, and they are all recorded in the standard COSMIC database.
What are Census genes and where is the updated version of the census ?
The Cancer Gene Census is a list of genes known to be involved in cancer. They are listed here.
Can I submit data to COSMIC ?
All mutation data in COSMIC is currently entered by our curators. If you would like to submit data for one of your publications, or even pre-publication, please contact firstname.lastname@example.org and one of our curators will be happy to help.
How do I calculate gene mutation frequencies across multiple transcript variants ?
Gene mutation frequency counts are currently calculated on a per-transcript basis, rather than across gene loci. We've strived to maximise the coding annotations derived from large scale studies providing genomic mutations, and this has resulted in the same genomic mutation being annotated to different splice variants of the same gene. An additional splice variant is only used in COSMIC if a mutation arises which maps to the coding domain of a gene, but not any of the existing COSMIC transcripts. One can therefore not add the prevalence counts of the splice variants together as this would potentially add the same genomic mutation more than once.
What is the difference between census and classic genes ?
A census gene is one that is known to be involved in cancer. The list of these genes is used to prioritise the literature curation for the COSMIC database. Once the literature for a census gene has been completely curated, it is released and sometimes termed a 'COSMIC classic' gene.
There are no mutations in the full text of the paper. Are these extracted from the supplementary material ?
Yes. We utilize supplementary material for curation when it contains additional information.
What are the rules for mutation syntax in COSMIC ?
Mutations are annotated using syntax derived from HGVS nomenclature recommendations (see http://varnomen.hgvs.org).
How do I examine a histology or cancer type ?
COSMIC may use an alternative histology terminology, for example small cell carcinoma instead of neuroendocrine carcinoma (or, for some sites, neuroendocrine carcinoma instead of small cell carcinoma). More information about our classification system can be found at the URL below and all COSMIC tumour site and histology translations are available to view as an excel spread sheet or tab delineated text file in the Classification documents found here. Note: You may also want to use our search to find out the matching disease classification for the alternative terminologies.
How do I examine colon cancer ?
Large Intestine for this cancer site. You can
find information about our classification system and download all COSMIC
tumour site and histology translations as an Excel spreadsheet or
tab-delineated text file in the
Classification documentation page.
How do I examine a tumour site ?
COSMIC may use an alternative site eg Colon versus Large Intestine. More information about our classification system can be found at the URL below and all COSMIC tumour site and histology translations are available to view as an excel spread sheet or tab delineated text file in the Classification documents found here.
Why is my search bringing back fewer records than expected ?
Check you have not got a filter (displayed in the sidebar) on unexpectedly, limiting the gene region, tumour type or site etc. Some genes will have been curated as part of systematic screens so will have some data in COSMIC but have not yet been manually curated so will have less data than has been published. This may be because they are in the list of genes waiting to be manually curated, or they are not included in the Cancer Gene Census.
Where can I find patient age information ?
If a paper gives the precise age then this is entered and displayed in years to 2 decimal places in the sample overview page, for example COSS1735169. Less precise age information is added as a remark and displayed on the sample overview page as, for example, Age=Adult; Age=Child; Age=Elderly; Age=young adult; Age=more than 65 years; Age=Adult 20-60 years. For an example see COSS1757821. If the paper uses term “paediatric” this is added as a remark Age=Child. In the past some papers reporting paediatric or adult leukaemias have had this information included in the Tumour remark section. This information is now included in the Individual Remark section Age=Child or Age=Adult as described above.
Has the whole gene been screened ?
Not necessarily. Sometimes the entire coding sequence and the intron-exon boundaries of a gene will have been screened, but at other times only specific exons, codons or a specific single nucleotide change in 1 codon will have been analysed. This information is not visible in COSMIC but can be obtained from the original publication from which the data was extracted.
How can I tell what part of a gene has been screened ?
Sometimes the entire coding sequence and the intron-exon boundaries of a gene will have been screened, but at other times only specific exons, codons or a specific single nucleotide change in 1 codon will have been analysed. This information is not visible in COSMIC but can be obtained from the original publication from which the data was extracted.
Is my gene fully curated in COSMIC? How are genes selected for manual curation in COSMIC ?
As new cancer genes are identified from the literature these are added to the Cancer Gene Census list. A gene which is not currently in the manually curated Classic Gene List may be awaiting completion of the initial curation process, thus the data will not yet have been released; a gene may not have been confirmed as a true cancer gene according to our selection criteria and is awaiting more evidence; alternatively we may have missed the gene in question. We welcome suggestions for missing genes at email@example.com.
Why can’t I find a particular publication in COSMIC ?
Publications are identified for manual curation of genes from the Classic Gene List by using weekly PubMed and PubCrawler searches. If data from a specific publication is missing it may have been missed from these searches or the paper may be awaiting curation, especially for some of the older well known cancer genes. Alternatively, the publication may be recorded in COSMIC but as a reference only if, for instance, the data was unclear or not presented in a format which was compatible with the COSMIC data entry system. We welcome suggestions concerning missing publications at firstname.lastname@example.org.
Is mutation data from cell lines included in COSMIC ?
Cell lines are included in COSMIC if they have been screened for mutations. You might also want to check the COSMIC Cell Line Project, where the genetics and genomics of large numbers of cancer cell lines have been systematically characterised.
Are mutations analysed by immunohistochemistry included in COSMIC ?
Mutations analysed solely by immunohistochemistry using mutation specific antibodies are not currently included in COSMIC.
Why does COSMIC contain data on overgrowth syndromes as they are not really cancer ?
Somatic mutations detected in tissues associated with overgrowth syndromes such as Proteus and Cloves syndromes are included in COSMIC. Not all somatic mutations give a growth advantage to the cells but the mutations that have been identified in context of these syndromes clearly do. Including these mutations in COSMIC will help us further define and understand cancer.
What does Inferred Breakpoint mean ?
This is the genomic breakpoint for a gene fusion. For many fusions this is not reported in detail so it is necessary to infer the position based on the reported mRNA transcripts in a given sample. To do this, it is assumed that each sample's breakpoint lies between the most 3' expressed exon of the 5' gene and the most 5' exon of the 3' gene, from the mRNAs reported in that sample. However, if the genomic breakpoint position is reported in detail for the sample then this is input as the Inferred Breakpoint.
What does Observed mRNA transcript mean ?
Many papers determine fusions between genes using expression technologies such as RT-PCR. A number of these studies have identified more than one transcript per sample, some finding over four different products between the same gene pair in one tumour. This implies significant alternative splicing of the mRNAs expressed from the fused gene pair. These alternative transcripts are input as Observed mRNA transcripts.
What are Related Breakpoints ?
These are either all the Inferred Breakpoints for a selected mRNA transcript mutation, or all the Observed mRNA transcripts for a selected inferred breakpoint mutation.
What is a Translocation Name ?
This is the syntax format describing the portions of mRNA present (in HGVS "r." format) from each gene in a fusion.
How is an inverted sequence annotated in a fusion ?
An "o" before a gene name is used to indicate an inverted sequence, e.g.
Why can't I find any information in COSMIC on a particular gene fusion pair ?
The curation of fusion data is on-going and the list of fusions currently curated in COSMIC can be found here. Sometimes an alternative transcript needs to be used to annotate a fusion so it may be necessary to search all transcripts for a gene to find any curated for fusions e.g. NOTCH1 and NOTCH1_ENST00000277541.
How do I calculate mutant frequencies ?
Positives _____________________ x 100 Positives + Negatives
However, whole genome screen data is not included.File 2. Lists samples with mutations (positives) from whole genome screens but samples without mutations (negatives) are not included. The number of samples analysed by whole genome screening can be extracted from File 3. by selecting rows where the 'whole genome screen' column is equal to 'y'. Frequencies can be calculated as follows -
Positives in File 2 _____________________________________ x 100 Whole Genome Screen Samples in File 3For the total dataset (targeted and whole genome screens), calculate the frequencies as follows -
Positives in File 1 + Positives in File 2 ___________________________________________________________________________________ x 100 Negatives in File 1 + Positives in File 1 + Whole Genome Screen Samples in File 3To calculate frequencies for a specific tissue/histology of interest, the sample set must be restricted to those matching the tissue/histology classification. Please see the help section on Sample Counting for a description of how we count samples in COSMIC, and also the Mutation Frequency section on the same page.
What does the term 'Whole Genome Screen' mean ?
We use the term 'Whole Genome Screen' to describe any study which has surveyed all genes in the genome, in contrast to a 'Targeted Screen' which surveys a smaller subset of genes. This term does not differentiate between whole genome sequencing and whole exome sequencing.