Data Downloads (release v87, 13th November 2018)

This page allows you to download the various COSMIC data files. It also has descriptions of the data contained in each file.

You will need to login to download the files. As part of COSMIC's growth and development plan, we have implemented a licensing strategy. Everyone is required to register in order to download data files, but only non-academic organisations need to pay a license fee. More information can be found on our licensing page.

Whole File Downloads

To download a complete file, simply click on the dark blue 'Download Whole File' button for the file that you require and your download will begin.

Filtered File Downloads

Some files can be filtered by any combination of gene, sample or cancer type:

  • click on the blue 'Download Filtered File' button to show the filter fields
  • fill in the filters that you require
  • as you type, look in the drop-down list for the gene, sample or cancer type that you need
  • the field will turn green if the filter matches something in the COSMIC database or red otherwise
  • click 'Download' to retrieve the filtered data

Scripted Downloads

You can download files programmatically. Click the purple 'Scripted download' button next to each file for information on how to retrieve that file via the command line or a script. All files for the current and past 6 versions of COSMIC are available for download. Check out our help pages for more information on downloading, and for an explanation of how to find a manifest for all available files.

Download a sample of COSMIC data

We have made the first 100 lines of each of the download files freely available so you can try out the data. More information can be found on our about page.

Complete mutation data


A tab separated table of all the point mutations in the Cosmic Cell Lines Project from the current release.

CosmicCLP_MutantExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Gene name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[2:B] Accession Number - The transcript identifier of the gene.

[3:C] Gene CDS length - Length of the gene (base pair) counts.

[4:D] HGNC id - if gene is in HGNC, this id helps linking it to HGNC.

[5:E] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[8:H] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[9:I] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[10:J] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[11:K] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[12:L] Primary Histology - The histological classification of the sample.

[13:M] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[14:N] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[15:O] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[16:P] Genome-wide screen - If the entire genome/exome has been sequenced.

[17:Q] Mutation Id - Unique mutation identifier.

[18:R] Mutation CDS - The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

[19:S] Mutation AA - The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page.

[20:T] Mutation Description - Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion etc.)

[21:U] Mutation zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[22:V] LOH - LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown.

[23:W] GRCh - The coordinate system used -

    37 = GRCh37/Hg19
    38 = GRCh38/Hg38

[24:X] Mutation genome position - The genomic coordinates of the mutation.

[25:Y] Mutation strand - Positive or negative.

[26:Z] SNP - All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing.

[27:AA] FATHMM prediction - More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from here. FATHMM descriptors -

    Neutral = Defined as Passenger or Tolerated.
    Pathogenic = Defined as Cancer or Damaging.

[28:AB] FATHMM Score - The scores are in the form of pvalues ranging from 0 to 1. Pvalues greater than 0.5 are pathogenic while less than 0.5 are benign. Pvalues close to 0 or 1 are the high confidence results which are more accurate. The results are annotated as 10 feature groups (separately for coding and non coding variants) details of which can be found in the original FATHMM-MKL paper.

[29:AC] Mutation somatic status - Information on whether the mutation was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    Confirmed Somatic = confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.
    Variant of unknown origin = known to be somatic but the tumour was sequenced without a matched normal.
    Previously observed = mutation reported as somatic previously but not in the current paper.

[30:AD] Mutation Verificastion Status - Information on whether the mutation has been validate -

    Verified = reported in other datasets including by the capilliary sequencing of the sample.
    Unverified = has not been reported in other datasets.

[31:AE] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication.

[32:AF] Study ID - The Study ID for the sample.

[33:AG] Institute,Institute Address,Catalogue Number - Availability details (cell line supplier).

[36:AJ] Sample Type,Tumour origin - where the sample has originated from including the tumour type.

[38:AL] Age - Age of the sample (if this information is provided with the publications).

Copy Number Data


All copy number segments identified by PICNIC analysis of the Affymetrix SNP6.0 array data in a comma separated file. The file (one for each cell line, identified by ID) includes every segment, including those not defined as gain or loss, for every cell line (or as separate files for each cell line, identified by ID).

copy_number/cell_lines_copy_number.csv

File Description

[column number:label] Heading

[1:A] sample_name,sample_id - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[3:C] SNPstart - The ID of the first SNP in the segment.

[4:D] SNPend - The ID of the last SNP in the segment.

[5:E] chr - The chromosome identifier for the genome location, GRCh38.

[6:F] startpos - The genomic location of the start of the segment, GRCh38.

[7:G] endpos - The genomic location of the end of the segment, GRCh38.

[8:H] chr_37 - The chromosome identifier for the genome location, GRCh37.

[9:I] start_37 - The genomic location of the start of the segment, GRCh37.

[10:J] end_37 - The genomic location of the end of the segment, GRCh37.

[11:K] minorCN - The number of copies of the minor allele.

[12:L] totalCN - The total number of alleles ie minor allele copy number + major allele copy number.


This tab separated file lists the copy number variants for each cell line identified by PICNIC analysis of the Affymetrix SNP6.0 array data. Please note that by default the COSMIC website only displays variants where the minor allele and total copy number is known. However,there is an option to view all variants. For more information on copy number data, please see http://cancer.sanger.ac.uk/cell_lines/analyses.

CosmicCLP_CompleteCNA.tsv.gz

File Description

[column number:label] Heading

[1:A] Id CNV - The primary key of the table holding the data (not stable, differs between releases).

[2:B] Id gene,Gene name - The ID and symbol of the gene which overlaps the copy number segment (or '-' where there is no overlapping gene).

[4:D] Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[6:F] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[7:G] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[8:H] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[9:I] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[10:J] Primary Histology - The histological classification of the sample.

[11:K] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[12:L] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[13:M] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[14:N] Sample Name - The name of the sample.

[15:O] Total_CN - The sum of the major and minor allele counts eg if ABB, Total copy number = 3.

[16:P] Minor Allele - The number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies).

[17:Q] Mut Type - Cell lines array data was analysed with PICNIC (http://www.sanger.ac.uk/science/tools/picnic) and gain/loss defined as follows -

    LOSS = Average genome ploidy <= 2.7 AND total copy number = 0 OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )
    GAIN = Average genome ploidy <= 2.7 AND total copy number >= 5 OR average genome ploidy > 2.7 AND total copy number >= 9

[18:R] Id Study - Lists the unique Ids of studies that have involved this copy number variation.

[19:S] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[20:T] Chromosome:G_Start..G_Stop - The genomic coordinates of the variation.

PICNIC Average Ploidies


A tab separated file listing the average ploidy of each cell line calculated using the PICNIC algorithm.

PICNIC_average_ploidies.tsv

File Description

[column number:label] Heading

[1:A] Sample Name - The name of the sample (cell line)

[2:B] Sample ID - The unique ID of the sample

[3:C] Average Ploidy - The average ploidy of the sample (cell line)

Gene Expression


The platform used was the Affymetrix Human Genome U219 Array. All gene expression data from the most current release of COSMIC Cell Lines Project in a tab separated file.

CosmicCLP_CompleteGeneExpression.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample id,Sample name - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[3:C] Gene Name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[4:D] Regulation - Defined as Over or Under expressed. More details from here.

[5:E] Z-score - Serves as an indicative score of expression level.

Non coding variants


A tab separated table of all non-coding mutations from the current release.

CosmicCLP_NCVExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample name,Sample id,Tumour id - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[4:D] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[5:E] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[6:F] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[7:G] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[8:H] Primary Histology - The histological classification of the sample.

[9:I] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[10:J] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[11:K] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[12:L] Whole Genome screen - if the entire genome/exome is sequenced.

[13:M] Mutation Id - unique non-coding variant identifier.

[14:N] Zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[15:O] Genome Version - The coordinate system used -

    37 = GRCh37/Hg19
    38 = GRCh38/Hg38

[16:P] Genome coordinates - The genomic coordinate of the mutation.

[17:Q] Mutation somatic status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.
    Previously observed = when the mutation has been reported as somatic previously but not in current paper.
    Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.

[18:R] WT SEQ - wild type sequence.

[19:S] MUT SEQ - Mutated sequence.

[20:T] SNP - All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing.

[21:U] FATHMM_MKL_NON_CODING_SCORE - FATHMM-MKL non-coding score. A p-value ranging from 0 to 1 where >= 0.7 is functionally significant.

[22:V] FATHMM_MKL_NON_CODING_GROUPS - FATHMM-MKL group classification. More details from here.

[23:W] FATHMM_MKL_CODING_SCORE - FATHMM-MKL coding score (p-value ranging from 0 to 1).

[24:X] FATHMM_MKL_CODING_GROUPS - FATHMM-MKL group classification (coding). More details from here.

[25:Y] Whole Genome Reseq - if the enitre genome is sequenced.

[26:Z] Whole_Exome - if the enitre exome is sequenced.

[27:AA] Id Study - Lists the unique Ids of studies that have involved this non coding mutation.

Raw Gene Expression


The platform used was the Affymetrix Human Genome U219 Array. This file contains all the raw gene expression data from the most current release of COSMIC Cell Lines Project in a tab separated file.

CosmicCLP_RawGeneExpression.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample id,Sample name - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[3:C] Gene Name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[4:D] Gene Expression - Expression level for the gene from the Affymetrix Human Genome U219 Array data.

VCF files (coding and non-coding mutations)


VCF file of all coding mutations in the cell lines project.

VCF/CellLinesCodingMuts.vcf.gz


VCF file of all non coding mutations in the cell lines project.

VCF/CellLinesNonCodingVariants.vcf.gz


Lists all substitution variants in the specific line identified by Caveman after filtering to remove common SNPs. Files will be contaminated with germline and false positive calls. These files are available for the GRCh37 archive only.

VCF/caveman.tgz


Lists all small insertion/deletion variants in the specific line identified by Pindel after filtering to remove common SNPs. Files will be contaminated with germline and false positive calls. These files are available for the GRCh37 archive only.

VCF/pindel.tgz

QC


This file lists the SNP fingerprint (based on 97 SNPs using the Sequenom system), STR fingerprint (including repository information for matched samples where available) and MSI status of all 1025 cancer cell lines.

QC.xlsx

Sequence Coverage Statistics


The file lists the exome sequencing statistics for all cell lines.

seq_stats.xls

File Description

[column number:label] Heading

[1:A] READ 0 - % of bases not covered by any sequence

[2:B] READ 21 - % of bases covered by a minimum of 21 reads

[3:C] READ 41 - % of bases covered by a minimum of 41 reads

[4:D] Rpair - Total number of read pairs

[5:E] Gbp Seq - Total sequence

[6:F] UM Pairs - % of unmapped reads

[7:G] Gbp Map - Total of mapped sequence

[8:H] Mapped - Percentage of sequence mapped

[9:I] Gbp Uniq - Total of mapped unique reads

[10:J] Uniq - Percentage of mapped reads that are unique

Genotypes


Files listing the SNP calls for each cell line identified by PICNIC analysis of Affymetrix SNP6.0 array data. Both a simple genotype (AA, BB - homozygous or AB - heterozygous) and a complex interpretation of the genotype are given (for example, in a triploid region of the genome the genotype maybe AAB).

genotypes.tar

File Description

[column number:label] Heading

[1:A] Chr - Chromosome GRCh38/hg38

[2:B] pos - Genome Position GRCh38/hg38

[3:C] ncopies.A - Number of copies of allele A

[4:D] ncopies.B - Number of copies of allele B

[5:E] Probe.Set.ID - SNP6.0 probe ID

[6:F] dbSNP.RS.ID - dbSNP reference ID

[7:G] Allele.A - Genotype 'A' nucleotide

[8:H] Allele.B - Genotype 'B' nucleotide

[9:I] chr_b36 - Chromosome NCBI36/hg18

[10:J] pos_b36 - Genome Position NCBI36/hg18

[11:K] chr_b37 - Chromosome GRCh37/hg19

[12:L] pos_b37 - Genome Position GRCh37/hg19

[13:M] complexGenotype - a complex interpretation of the genotype eg in a triploid region the genotype maybe AAB

[14:N] simpleGenotype - a simple genotype eg AA, BB - homozygous or AB - heterozygous

Fasta File (genes)


CDS sequence for all the genes in Cell Line Project.

All_CellLines_Genes.fasta.gz

Oracle Database Dump


The oracle database dump of the current COSMIC Cell Lines Project release. Please see the OracleSchemaDocumentation.pdf for a description of the database schema.

CLP_ORACLE_EXPORT.dmp.gz.tar