Data Downloads (release v87, 13th November 2018)

This page allows you to download the various COSMIC data files. It also has descriptions of the data contained in each file.

You will need to login to download the files. As part of COSMIC's growth and development plan, we have implemented a licensing strategy. Everyone is required to register in order to download data files, but only non-academic organisations need to pay a license fee. More information can be found on our licensing page.

Whole File Downloads

To download a complete file, simply click on the dark blue 'Download Whole File' button for the file that you require and your download will begin.

Filtered File Downloads

Some files can be filtered by any combination of gene, sample or cancer type:

  • click on the blue 'Download Filtered File' button to show the filter fields
  • fill in the filters that you require
  • as you type, look in the drop-down list for the gene, sample or cancer type that you need
  • the field will turn green if the filter matches something in the COSMIC database or red otherwise
  • click 'Download' to retrieve the filtered data

Scripted Downloads

You can download files programmatically. Click the purple 'Scripted download' button next to each file for information on how to retrieve that file via the command line or a script. All files for the current and past 6 versions of COSMIC are available for download. Check out our help pages for more information on downloading, and for an explanation of how to find a manifest for all available files.

Download a sample of COSMIC data

We have made the first 100 lines of each of the download files freely available so you can try out the data. More information can be found on our about page.

Classification Information


A comma separated table of COSMIC cancer classification information.

classification.csv

File Description

[column number:label] Heading

[1:A] Cosmic_Phenotype_id - Unique COSMIC identifier for the classification.

[2:B] Site_Primary - Primary tissue specified in the publication.

[3:C] Site_Subtype1 - Sub tissue specified in the publication.

[4:D] Site_Subtype2 - Sub tissue specified in the publication.

[5:E] Site_Subtype3 - Sub tissue specified in the publication.

[6:F] Histology - Primary histology specified in the publication.

[7:G] Hist_Subtype1 - Sub histology specified in the publication.

[8:H] Hist_Subtype2 - Sub histology specified in the publication.

[9:I] Hist_Subtype3 - Sub histology specified in the publication.

[10:J] Site_Primary_COSMIC - Primary tissue specified in COSMIC.

[11:K] Site_Subtype1_COSMIC - Sub tissue specified in COSMIC.

[12:L] Site_Subtype2_COSMIC - Sub tissue specified in COSMIC.

[13:M] Site_Subtype3_COSMIC - Sub tissue specified in COSMIC.

[14:N] Histology_COSMIC - Primary histology specified in COSMIC.

[15:O] Hist_Subtype1_COSMIC - Sub histology specified in COSMIC.

[16:P] Hist_Subtype2_COSMIC - Sub histology specified in COSMIC.

[17:Q] Hist_Subtype3_COSMIC - Sub histology specified in COSMIC.

[18:R] NCI code - NCI thesaurus code for tumour histological classification. For details see https://ncit.nci.nih.gov

[19:S] EFO code - Experimental Factor Ontology (EFO), for details see here

COSMIC Complete Mutation Data (Targeted Screens)


A tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set.

CosmicCompleteTargetedScreensMutantExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Gene name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC symbol.

[2:B] Accession Number - The transcript identifier of the gene.

[3:C] Gene CDS length - Length of the gene (base pair) counts.

[4:D] HGNC id - Unique HGNC identifier, if the gene is in HGNC.

[5:E] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[8:H] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[9:I] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[10:J] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[11:K] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[12:L] Primary Histology - The histological classification of the sample.

[13:M] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[14:N] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[15:O] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[16:P] Genome-wide screen - if the entire genome/exome is sequenced.

[17:Q] Mutation Id - unique mutation identifier.

[18:R] Mutation CDS - The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

[19:S] Mutation AA - The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page.

[20:T] Mutation Description - Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion, unknown etc.)

[21:U] Mutation zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[22:V] LOH - LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown.

[23:W] GRCh - The coordinate system used -

    37 = GRCh37/Hg19
    38 = GRCh38/Hg38

[24:X] Mutation genome position - The genomic coordinates of the mutation.

[25:Y] Mutation strand - Positive or negative.

[26:Z] SNP - All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing.

[27:AA] Resistance Mutation - The mutation confers drug resistance (see CosmicResistanceMutations.tsv.gz for gene/drug details).

[28:AB] FATHMM prediction - More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from here. FATHMM descriptors -

    Neutral = Defined as Passenger or Tolerated.
    Pathogenic = Defined as Cancer or Damaging.

[29:AC] FATHMM Score - The scores are in the form of pvalues ranging from 0 to 1. Pvalues greater than 0.5 are pathogenic while less than 0.5 are benign. Pvalues close to 0 or 1 are the high confidence results which are more accurate. The results are annotated as 10 feature groups (separately for coding and non coding variants) details of which can be found in the original FATHMM-MKL paper.

[30:AD] Mutation somatic status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    Previously observed = when the mutation has been reported as somatic previously but not in current paper.
    variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.
    Confirmed Somatic = if the mutation has been confirmed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.

[31:AE] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication.

[32:AF] Id Study - Lists the unique Ids of studies that have involved this sample.

[33:AG] Sample Type,Tumour origin - Describes where the sample has originated from including the tumour type.

[35:AI] Age - Age of the individual (if this information is provided with the publications).

COSMIC Mutation Data (Genome Screens)


A tab separated table of coding point mutations from genome wide screens (including whole exome sequencing).

CosmicGenomeScreensMutantExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Gene name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[2:B] Accession Number - The transcript identifier of the gene.

[3:C] Gene CDS length - Length of the gene (base pair) counts.

[4:D] HGNC id - Unique HGNC identifier, if the gene is in HGNC.

[5:E] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[8:H] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[9:I] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[10:J] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[11:K] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[12:L] Primary Histology - The histological classification of the sample.

[13:M] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[14:N] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[15:O] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[16:P] Genome-wide screen - if the entire genome/exome is sequenced.

[17:Q] Mutation Id - unique mutation identifier.

[18:R] Mutation CDS - The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

[19:S] Mutation AA - The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page.

[20:T] Mutation Description - Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion, unknown etc.)

[21:U] Mutation zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[22:V] LOH - LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown.

[23:W] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[24:X] Mutation genome position - The genomic coordinates of the mutation.

[25:Y] Mutation strand - positive or negative.

[26:Z] SNP - All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing.

[27:AA] FATHMM prediction - More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from here. FATHMM descriptors -

    Pathogenic = Defined as Cancer or Damaging.
    Neutral = Defined as Passenger or Tolerated.

[28:AB] FATHMM Score - The scores are in the form of pvalues ranging from 0 to 1. Pvalues greater than 0.5 are pathogenic while less than 0.5 are benign. Pvalues close to 0 or 1 are the high confidence results which are more accurate. The results are annotated as 10 feature groups (separately for coding and non coding variants) details of which can be found in the original FATHMM-MKL paper.

[29:AC] Mutation somatic status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.
    Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.
    Previously observed = when the mutation has been reported as somatic previously but not in current paper.

[30:AD] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication.

[31:AE] Id Study - Lists the unique Ids of studies that have involved this sample.

[32:AF] Sample Type,Tumour origin - Describes where the sample has originated from including the tumour type.

[34:AH] Age - Age of the individual (if this information is provided with the publications).

COSMIC Mutation Data


A tab separated table of all COSMIC coding point mutations from targeted and genome wide screens from the current release.

CosmicMutantExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Gene name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[2:B] Accession Number - The transcript identifier of the gene.

[3:C] Gene CDS length - Length of the gene (base pair) counts.

[4:D] HGNC id - if gene is in HGNC, this id helps linking it to HGNC.

[5:E] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[8:H] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[9:I] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[10:J] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[11:K] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[12:L] Primary Histology - The histological classification of the sample.

[13:M] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[14:N] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[15:O] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[16:P] Genome-wide screen - if the entire genome/exome is sequenced.

[17:Q] Mutation Id - unique mutation identifier.

[18:R] Mutation CDS - The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

[19:S] Mutation AA - The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page.

[20:T] Mutation Description - Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion, unknown etc.)

[21:U] Mutation zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[22:V] LOH - LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown.

[23:W] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[24:X] Mutation genome position - The genomic coordinates of the mutation.

[25:Y] Mutation strand - postive or negative.

[26:Z] SNP - All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing.

[27:AA] Resistance Mutation - mutation confers drug resistance (see CosmicResistanceMutations.tsv.gz for gene/drug details).

[28:AB] FATHMM prediction - More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from here. FATHMM descriptors -

    Pathogenic = Defined as Cancer or Damaging.
    Neutral = Defined as Passenger or Tolerated.

[29:AC] FATHMM Score - The scores are in the form of pvalues ranging from 0 to 1. Pvalues greater than 0.5 are pathogenic while less than 0.5 are benign. Pvalues close to 0 or 1 are the high confidence results which are more accurate. The results are annotated as 10 feature groups (separately for coding and non coding variants) details of which can be found in the original FATHMM-MKL paper.

[30:AD] Mutation somatic status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    Previously observed = when the mutation has been reported as somatic previously but not in current paper.
    Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.
    Variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.

[31:AE] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication.

[32:AF] Id Study - Lists the unique Ids of studies that have involved this sample.

[33:AG] Sample Type,Tumour origin - Describes where the sample has originated from including the tumour type.

[35:AI] Age - Age of the sample (if this information is provided with the publications).

Structural Genomic Rearrangements


All structural variants from the current release in a tab separated table.

CosmicStructExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[4:D] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[5:E] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[6:F] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[7:G] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[8:H] Primary Histology - The histological classification of the sample.

[9:I] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[10:J] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[11:K] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[12:L] Mutation Id - unique mutation identifier.

[13:M] Mutation Type - Type of mutation : Intra/Inter (chromosomal), tandem duplication, deletion, inversion, complex substitutions, complex amplicons.

[14:N] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[15:O] Description - A syntax which describes the structural variant, based on HGVS recommendations.

[16:P] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in.

[17:Q] ID_STUDY - Lists the unique Ids of studies that have involved this structural mutation.


All breakpoint data from the current release in a tab separated table.

CosmicBreakpointsExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[4:D] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[5:E] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[6:F] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[7:G] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[8:H] Primary Histology - The histological classification of the sample.

[9:I] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[10:J] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[11:K] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[12:L] Mutation Type - Type of mutation : Intra/Inter (chromosomal), tandem duplication, deletion, inversion, complex substitutions, complex amplicons.

[13:M] Mutation Id - unique mutation identifier.

[14:N] Breakpoint Order - For variants involving multiple breakpoints, the predicted order along chromosome(s).Otherwise '0'.

[15:O] GRCh - The coordinate system used -

    37 = GRCh37/Hg19
    38 = GRCh38/Hg38

[16:P] Chrom From - The chromosome where the first variant/breakpoint occurs.

[17:Q] Location From min - The first position in breakpoint range.

[18:R] Location From max - The last position in breakpoint range.

[19:S] Strand From - positive or negative.

[20:T] Chrom To - The chromosome where the last variant/breakpoint occurs.

[21:U] Location To min - The first position in breakpoint range.

[22:V] Location To max - The last position in breakpoint range.

[23:W] Strand To - positive or negative.

[24:X] Non-templated ins seq - Non Templated Sequence (if any) which is inserted at the breakpoint. The sequence is not encoded.

[25:Y] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in.

[26:Z] Id Study - Lists the unique Ids of studies that have involved this structural mutation.

Complete Fusion Export


All gene fusion mutation data from the current release in a tab separated table.

CosmicFusionExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample id,Sample name, - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[3:C] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[4:D] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[5:E] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[6:F] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[7:G] Primary Histology - The histological classification of the sample.

[8:H] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[9:I] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[10:J] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[11:K] Fusion Id - Unique fusion mutation identifier.

[12:L] Translocation Name - Syntax describing the portions of mRNA present (in HGVS 'r.' format) from each gene (allows representation of UTR sequences).

[13:M] Fusion type - Type of mutation.

[14:N] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in.

[15:O] Id Study - Lists the unique Ids of studies that have involved this fusion mutation.

All Mutations in Census Genes


All coding mutations in genes listed in the Cancer Gene Census ( http://cancer.sanger.ac.uk/census ) in a tab separated table.

CosmicMutantExportCensus.tsv.gz

File Description

[column number:label] Heading

[1:A] Gene name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[2:B] Accession Number - The transcript identifier of the gene.

[3:C] Gene CDS length - Length of the gene (base pair) counts.

[4:D] HGNC id - if gene is in HGNC, this id helps linking it to HGNC.

[5:E] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[8:H] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[9:I] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[10:J] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[11:K] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[12:L] Primary Histology - The histological classification of the sample.

[13:M] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[14:N] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[15:O] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[16:P] Genome-wide screen - if the entire genome/exome is sequenced.

[17:Q] Mutation Id - unique mutation identifier.

[18:R] Mutation CDS - The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

[19:S] Mutation AA - The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page.

[20:T] Mutation Description - Type of mutation (substitution, deletion, insertion, complex, fusion etc.)

[21:U] Mutation zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[22:V] LOH - LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown.

[23:W] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[24:X] Mutation genome position - The genomic coordinates of the mutation.

[25:Y] Mutation strand - positive or negative.

[26:Z] SNP - All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing.

[27:AA] Resistance Mutation - mutation confers drug resistance (see CosmicResistanceMutations.tsv.gz for gene/drug details).

[28:AB] FATHMM prediction - More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from here. FATHMM descriptors -

    Neutral = Defined as Passenger or Tolerated.
    Pathogenic = Defined as Cancer or Damaging.

[29:AC] FATHMM score - The FATHMM-MKL functional score is a p-value, ranging from 0 to 1. Scores above 0.5 are deleterious, but in order to highlight the most significant data in COSMIC, only scores >= 0.7 are classified as 'Pathogenic'. Mutations are classed as 'Neutral' if the score is <= 0.5.

[30:AD] Mutation somatic status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    Variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.
    Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.
    Previously observed = when the mutation has been reported as somatic previously but not in current paper.

[31:AE] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication.

[32:AF] Id Study - Lists the unique Ids of studies that have involved this sample.

[33:AG] Sample Type,Tumour origin - Describes where the sample has originated from including the tumour type.

[35:AI] Age - Age of the sample (if this information is provided with the publications).

[36:AJ] Tier - 1 or 2 [see here for details or Tier 1 and 2]

Non coding variants


A tab separated table of all non-coding mutations from the current release.

CosmicNCV.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample name,Sample id,Tumour id - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[4:D] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[5:E] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[6:F] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[7:G] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[8:H] Primary Histology - The histological classification of the sample.

[9:I] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[10:J] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[11:K] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[12:L] Id NCV - unique non-coding variant identifier.

[13:M] Zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[14:N] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[15:O] Genome position - The genomic cooridnate of the mutation.

[16:P] Mutation somatic status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.
    Previously observed = when the mutation has been reported as somatic previously but not in current paper.
    Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.

[17:Q] WT SEQ - wild type sequence.

[18:R] MUT SEQ - Mutated sequence.

[19:S] SNP - All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing.

[20:T] FATHMM_MKL_NON_CODING_SCORE - FATHMM-MKL non-coding score. A p-value ranging from 0 to 1 where >= 0.7 is functionally significant.

[21:U] FATHMM_MKL_NON_CODING_GROUPS - FATHMM-MKL group classification. More details from here.

[22:V] FATHMM_MKL_CODING_SCORE - FATHMM-MKL coding score (p-value ranging from 0 to 1).

[23:W] FATHMM_MKL_CODING_GROUPS - FATHMM-MKL group classification (coding). More details from here.

[24:X] Whole Genome Reseq - if the enitre genome is sequenced.

[25:Y] Whole_Exome - if the enitre exome is sequenced.

[26:Z] Id Study - Lists the unique Ids of studies that have involved this non coding mutation.

[27:AA] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in.

Copy Number Variants


All copy number abberations from the current release in a tab separated table. For more information on copy number data, please see http://cancer.sanger.ac.uk/cosmic/help/cnv/overview.

CosmicCompleteCNA.tsv.gz

File Description

[column number:label] Heading

[1:A] CNV_ID - The unique identifier for the variant (not stable, differs between releases).

[2:B] Id gene,Gene name - The ID and symbol of the gene which overlaps the copy number segment (or '-' where there is no overlapping gene).

[4:D] Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[6:F] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[7:G] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[8:H] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[9:I] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[10:J] Primary Histology - The histological classification of the sample.

[11:K] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[12:L] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[13:M] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[14:N] Sample Name - The name of the sample.

[15:O] Total_CN - The sum of the major and minor allele counts eg if ABB, total copy number = 3.

[16:P] Minor Allele - The number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies).

[17:Q] Mut Type - Defined as Gain or Loss. For ICGC samples; as defined in the original data. For TCGA samples reanalysed with ASCAT -

    LOSS = average genome ploidy <= 2.7 AND total copy number = 0 OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )
    GAIN = average genome ploidy <= 2.7 AND total copy number >= 5 OR average genome ploidy > 2.7 AND total copy number >= 9

[18:R] Id Study - Lists the unique Ids of studies that have involved this copy number variation.

[19:S] GRCh - The coordinate system used -

    37 = GRCh37/Hg19
    38 = GRCh38/Hg38

[20:T] Chromosome:G_Start..G_Stop - The genomic coordinates of the variation.

Gene Expression


All gene expression level 3 data from the TCGA portal for the current most release in a tab separated table. Please note : The platform codes currently used to produce the COSMIC gene expression values are: IlluminaGA_RNASeqV2, IlluminaHiSeq_RNASeqV2, AgilentG4502A_07_2, AgilentG4502A_07_3. For more information on the gene expression data, please see http://cancer.sanger.ac.uk/cosmic/analyses.

CosmicCompleteGeneExpression.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample id,Sample name - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[3:C] Gene name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[4:D] Regulation - it could be over or under depending on the scores from different platforms if they are above or below the threshold.

[5:E] Z-score - z_score serves as an indicative score taken from the gene_expression from different platforms in order of preference: IlluminaHiSeq_RNASeqV2, IlluminaGA_RNASeqV2, AgilentG4502A_07_3.

[6:F] Id Study - Lists the unique Ids of studies that have involved this gene expression data.

Methylation


TCGA Level 3 methylation data from the ICGC portal for the current release in a tab separated table. More information on the methylation data is available from http://cancer.sanger.ac.uk/cosmic/analyses.

CosmicCompleteDifferentialMethylation.tsv.gz

File Description

[column number:label] Heading

[1:A] Study_ID - The study Id for these data.

[2:B] Id Sample,Sample name,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the TCGA.

[5:E] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[6:F] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[7:G] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[8:H] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[9:I] Primary Histology - The histological classification of the sample.

[10:J] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[11:K] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[12:L] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[13:M] Fragment Id - The unique probe Id for a specific CpG.

[14:N] Genome Version - The coordinate system used -

    37 = GRCh37/Hg19
    38 = GRCh38/Hg38

[15:O] Chromosome - The chromosome location of the probe (1-22, X or Y).

[16:P] Position - The genome location of the CpG targeted by the probe (1-based coordinates).

[17:Q] Strand - Positive or negative.

[18:R] Gene Name - The gene name (if the probe falls within the coding region of a COSMIC gene) or the probe annotation as descibed by Illumina.

[19:S] Methylation - The methylation level; H (High, beta-value >0.8) or L (Low, beta-value < 0.2).

[20:T] Avg Beta Value Normal - The average beta-value across the normal population. The beta-value of the tumour must differ from this value by >0.5 to be considered a variant.

[21:U] Beta Value - The beta-value for the probe in the tumour sample. Only values >0.8 (High) or <0.2 (Low) are included.

[22:V] Two Sided P-Value - The two sided p-value.

Cancer Gene Census


A list of all cancer census genes from the current release in a comma separated table. The census table is exported from http://cancer.sanger.ac.uk/census and the format is the same.

cancer_gene_census.csv

COSMIC Sample Features


All the features related to a sample from the current release in a tab separated file.

CosmicSample.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample id,Sample name,Id tumour,Id Individual - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[5:E] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[6:F] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[7:G] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[8:H] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[9:I] Primary Histology - The histological classification of the sample.

[10:J] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[11:K] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[12:L] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[13:M] Therapy Relationship - Relates the time-point of tissue sampling to the drug therapy used to treat the tumour.

[14:N] Sample Differentiator - Gives additional information if more than one sample (e.g. carcinomatous and sarcomatous components) from a tumour has been screened for mutations or if samples from a tumour were taken at different time points.

[15:O] Mutation Allele Specification - Where a publication has information on more than one mutation for one gene in a sample and reports whether or not the mutations occurred on the same or different chromosomes.

[16:P] Msi - If microsatellite instability data is given in the publication per sample then High, Low, Stable/Low, MSI or Stable is reported in COSMIC. Unknown is the default.

[17:Q] Average Ploidy - The average ploidy of the sample, calculated from copy number data (where available).

[18:R] Whole Genome Screen - 'y' if the sample was whole genome screened.

[19:S] Whole Exome Screen - 'y' if the sample was whole exome sequenced.

[20:T] Sample Remark - Any additional sample information e.g. % mutant allele burden.

[21:U] Drug Response - Clinical and in vitro responses to drugs (particularly targeted drugs). Phrasing based on RECIST guidelines. Note that in COSMIC, SD (stable disease) and PD (progressive disease) = clinical primary non response.

[22:V] Grade - Grade of tumour. The phrase 'Some Grade data are given in publication' is used when publication reports grade data or when data hasn't been given per sample. More detailed data follow commonly used grading systems in tumours.

[23:W] Age at tumour recurrence - Where both primary and recurrent tumour samples from an individual have been screened for mutations and the age (in years) of the patient at the time of the recurrence is different to that at diagnosis.

[24:X] Stage - Stage of tumour. The phrase 'Some Stage data are given in publication' is used when publication reports stage data or when data hasn't been given per sample. More detailed data follow commonly used staging systems in tumours.

[25:Y] Cytogenetics - Karyotype of the tumour.

[26:Z] Metastatic Site - Tissue site of any metastases identified in an individual.

[27:AA] Tumour Source - Source of tumour tissue sample e.g. primary, metastasis.

[28:AB] Tumour Remark - Any additional tumour information e.g. metachronous tumour.

[29:AC] Age - Age (in years) of individual at diagnosis.

[30:AD] Ethnicity - Ethnicity (e.g. Caucasian) of individual.

[31:AE] Environmental Variables - Environmental variables to which an individual has been exposed (e.g. viral exposure, smoking status).

[32:AF] Germline Mutation - Gene name/mutation if a germline mutation as well as a somatic mutation has been detected in the same gene in the same tumour sample.

[33:AG] Therapy - Any significant treatment an individual has received prior to mutation screening.

[34:AH] Family - Any familial cancer history for an individual or familial relationships of individuals screened for mutations in the same publication.

[35:AI] Normal tissue tested - If normal tissue from the same individual has been screened for mutations.

[36:AJ] Gender - Sex of individual.

[37:AK] Individual Remark - Any additional individual information (e.g. age group, hereditary syndromes).

[38:AL] NCI code - NCI thesaurus code for tumour histological classification.

[39:AM] Sample Type - Describes where the sample originated from.

COSMIC HGNC


A tab separated table showing the relationship between the Cancer Gene Census, COSMIC ID, Gene Name, HGNC ID and Entrez ID.

CosmicHGNC.tsv.gz

File Description

[column number:label] Heading

[1:A] COSMIC_ID - COSMIC Gene ID (COSG*).

[2:B] COSMIC_GENE_NAME - Gene name used in COSMIC.

[3:C] Entrez_id - Entrez ID mapping.

[4:D] HGNC_ID - HGNC mapping.

[5:E] Mutated? - Does the gene have coding mutations y/n.

[6:F] Cancer_census? - Is the gene in the Cancer gene census y/n.

[7:G] Expert Curated? - Has the gene been manually curated by the team of expert curators y/n.

COSMIC Resistance Mutations


A tab separated table listing the details of all mutations in COSMIC which are known to confer drug resistance.

CosmicResistanceMutations.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample name,Sample id - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[3:C] Gene Name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[4:D] Transcript - The transcript identifier (accession number) of the gene.

[5:E] Census Gene - Is the gene in the Cancer Gene Census (Yes, or No).

[6:F] Drug Name - The name of the drug which the mutation confers resistance to.

[7:G] ID Mutation - The unique mutation identifier (COSM).

[8:H] AA Mutation - The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society.

[9:I] CDS Mutation - The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

[10:J] Primary Tissue - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[11:K] Tissue Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[12:L] Tissue Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[13:M] Histology - The histological classification of the sample.

[14:N] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[15:O] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[16:P] Pubmed ID - The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication.

[17:Q] CGP Study - Lists the unique Ids of studies that have involved this sample.

[18:R] Somatic Status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    Previously observed = when the mutation has been reported as somatic previously but not in current paper.
    Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.
    Variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.

[19:S] Sample Type - Describes where the sample has originated from including the tumour type.

[20:T] Zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[21:U] Genome Coordinates (GRCh37/38) - The genome location of the mutation (chr:start..end), on the specified genome version.

[22:V] Tier - 1 or 2 [see here for details or Tier 1 and 2]

ASCAT Ploidy and Purity Estimates


A tab separated table listing the ploidy and aberrant cell fraction (purity estimate), for TCGA samples re-analysed using ASCAT.

ascat_acf_ploidy.tsv

File Description

[column number:label] Heading

[1:A] Cancer_Type_Code - The disease code (decode available from https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm).

[2:B] Sample - The name of the sample.

[3:C] Aberrant_Cell_Fraction(Purity) - The aberrant cell fraction (purity estimate).

[4:D] Ploidy - The ploidy of the genome.

VCF Files (coding and non-coding mutations)


VCF file of all coding mutations in the current release.

VCF/CosmicCodingMuts.vcf.gz


VCF file of all non coding mutations in the current release.

VCF/CosmicNonCodingVariants.vcf.gz

Fasta File (genes)


COSMIC Transcripts


A tab separated table listing the gene name and transcript accession for each gene ID.

CosmicTranscripts.tsv.gz

File Description

[column number:label] Heading

[1:A] Gene ID - The unique ID of the gene.

[2:B] Gene_NAME - The name of the gene.

[3:C] Transcript ID - The accession of the transcript.

Oracle Database Dump


The oracle database dump of the current release. Please see the help document OracleSchemaDocumentation.pdf for a description of the database schema.

COSMIC_ORACLE_EXPORT.dmp.gz.tar