Data Downloads (release v98, 23rd May 2023)

We are supporting the legacy downloads for a year (i.e for v98 and v99), thereafter these downloads will be phased out and all the downloads will be available in the new format on this page : New Downloads

Commercial users: please access COSMIC data downloads from the Qiagen website.

This page allows you to download the various COSMIC data files. It also has descriptions of the data contained in each file.

You will need to login to download the files. As part of COSMIC's growth and development plan, we have implemented a licensing strategy. Everyone is required to register in order to download data. More information can be found on our licensing page.

Whole File Downloads

To download a complete file, simply click on the dark blue 'Download Whole File' button for the file that you require and your download will begin.

Filtered File Downloads

Some files can be filtered by any combination of gene, sample or cancer type:

  • click on the blue 'Download Filtered File' button to show the filter fields
  • fill in the filters that you require
  • as you type, look in the drop-down list for the gene, sample or cancer type that you need
  • the field will turn green if the filter matches something in the COSMIC database or red otherwise
  • click 'Download' to retrieve the filtered data

Scripted Downloads

You can download files programmatically. Click the purple 'Scripted download' button next to each file for information on how to retrieve that file via the command line or a script. All files for the current and past 3 versions of COSMIC are available for download. Check out our help pages for more information on downloading, and for an explanation of how to find a manifest for all available files.

Download a sample of COSMIC data

We have made the first 100 lines of each of the download files freely available so you can try out the data. More information can be found on our about page.

Complete mutation data


A tab separated table of all the point mutations in the Cosmic Cell Lines Project from the current release.

CosmicCLP_MutantExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Gene name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[2:B] Accession Number - The transcript identifier of the gene.

[3:C] Gene CDS length - Length of the gene (base pair) counts.

[4:D] HGNC id - if gene is in HGNC, this id helps linking it to HGNC.

[5:E] Sample name,Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[8:H] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[9:I] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[10:J] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[11:K] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[12:L] Primary Histology - The histological classification of the sample.

[13:M] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[14:N] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[15:O] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[16:P] Genome-wide screen - If the entire genome/exome has been sequenced.

[17:Q] GENOMIC_MUTATION_ID - Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome. This identifier is trackable and stable between different versions of the release.

[18:R] LEGACY_MUTATION_ID - Legacy mutation identifier (COSM) that will represent existing COSM mutation identifiers.

[19:S] MUTATION_ID - An internal mutation identifier to uniquely represent each mutation on a specific transcript on a given assembly build.

[20:T] Mutation CDS - The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence.

[21:U] Mutation AA - The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page.

[22:V] Mutation Description - Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion etc.)

[23:W] Mutation zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[24:X] LOH - LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown.

[25:Y] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[26:Z] Mutation genome position - The genomic coordinates of the mutation.

[27:AA] Mutation strand - Positive or negative.

[28:AB] Mutation somatic status - Information on whether the mutation was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    Previously observed = mutation reported as somatic previously but not in the current paper.
    Variant of unknown origin = known to be somatic but the tumour was sequenced without a matched normal.
    Confirmed Somatic = confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.

[29:AC] Mutation Verificastion Status - Information on whether the mutation has been validate -

    Verified = reported in other datasets including by the capilliary sequencing of the sample.
    Unverified = has not been reported in other datasets.

[30:AD] Pubmed_PMID - The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication.

[31:AE] Study ID - The Study ID for the sample.

[32:AF] Institute,Institute Address,Catalogue Number - Availability details (cell line supplier).

[35:AI] Sample Type,Tumour origin - where the sample has originated from including the tumour type.

[37:AK] Age - Age of the sample (if this information is provided with the publications).

[38:AL] HGVSP - Human Genome Variation Society peptide syntax.

[39:AM] HGVSC - Human Genome Variation Society coding dna sequence syntax (CDS).

[40:AN] HGVSG - Human Genome Variation Society genomic syntax (3' shifted).

Copy Number Data


All copy number segments identified by PICNIC analysis of the Affymetrix SNP6.0 array data in a comma separated file. The file (one for each cell line, identified by ID) includes every segment, including those not defined as gain or loss, for every cell line (or as separate files for each cell line, identified by ID).

copy_number/cell_lines_copy_number.csv

File Description

[column number:label] Heading

[1:A] sample_name,sample_id - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[3:C] SNPstart - The ID of the first SNP in the segment.

[4:D] SNPend - The ID of the last SNP in the segment.

[5:E] chr - The chromosome identifier for the genome location, GRCh38.

[6:F] startpos - The genomic location of the start of the segment, GRCh38.

[7:G] endpos - The genomic location of the end of the segment, GRCh38.

[8:H] chr_37 - The chromosome identifier for the genome location, GRCh37.

[9:I] start_37 - The genomic location of the start of the segment, GRCh37.

[10:J] end_37 - The genomic location of the end of the segment, GRCh37.

[11:K] minorCN - The number of copies of the minor allele.

[12:L] totalCN - The total number of alleles ie minor allele copy number + major allele copy number.


This tab separated file lists the copy number variants for each cell line identified by PICNIC analysis of the Affymetrix SNP6.0 array data. Please note that by default the COSMIC website only displays variants where the minor allele and total copy number is known. However,there is an option to view all variants. For more information on copy number data, please see http://cancer.sanger.ac.uk/cell_lines/analyses.

CosmicCLP_CompleteCNA.tsv.gz

File Description

[column number:label] Heading

[1:A] Id CNV - The primary key of the table holding the data (not stable, differs between releases).

[2:B] Id gene,Gene name - The ID and symbol of the gene which overlaps the copy number segment (or '-' where there is no overlapping gene).

[4:D] Sample id,Id tumour - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[6:F] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[7:G] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[8:H] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[9:I] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[10:J] Primary Histology - The histological classification of the sample.

[11:K] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[12:L] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[13:M] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[14:N] Sample Name - The name of the sample.

[15:O] Total_CN - The sum of the major and minor allele counts eg if ABB, Total copy number = 3.

[16:P] Minor Allele - The number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies).

[17:Q] Mut Type - Cell lines array data was analysed with PICNIC (http://www.sanger.ac.uk/science/tools/picnic) and gain/loss defined as follows -

    GAIN = Average genome ploidy <= 2.7 AND total copy number >= 5 OR average genome ploidy > 2.7 AND total copy number >= 9
    LOSS = Average genome ploidy <= 2.7 AND total copy number = 0 OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )

[18:R] Id Study - Lists the unique Ids of studies that have involved this copy number variation.

[19:S] GRCh - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[20:T] Chromosome:G_Start..G_Stop - The genomic coordinates of the variation.

PICNIC Average Ploidies


A tab separated file listing the average ploidy of each cell line calculated using the PICNIC algorithm.

PICNIC_average_ploidies.tsv

File Description

[column number:label] Heading

[1:A] Sample Name - The name of the sample (cell line)

[2:B] Sample ID - The unique ID of the sample

[3:C] Average Ploidy - The average ploidy of the sample (cell line)

Gene Expression


The platform used was the Affymetrix Human Genome U219 Array. All gene expression data from the most current release of COSMIC Cell Lines Project in a tab separated file.

CosmicCLP_CompleteGeneExpression.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample id,Sample name - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[3:C] Gene Name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[4:D] Regulation - Defined as Over or Under expressed. More details from here.

[5:E] Z-score - Serves as an indicative score of expression level.

Non coding variants


A tab separated table of all non-coding mutations from the current release.

CosmicCLP_NCVExport.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample name,Sample id,Tumour id - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers.

[4:D] Primary Site - The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from here. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers.

[5:E] Site Subtype 1 - Further sub classification (level 1) of the samples tissue of origin.

[6:F] Site Subtype 2 - Further sub classification (level 2) of the samples tissue of origin.

[7:G] Site Subtype 3 - Further sub classification (level 3) of the samples tissue of origin.

[8:H] Primary Histology - The histological classification of the sample.

[9:I] Histology Subtype 1 - Further histological classification (level 1) of the sample.

[10:J] Histology Subtype 2 - Further histological classification (level 2) of the sample.

[11:K] Histology Subtype 3 - Further histological classification (level 3) of the sample.

[12:L] Whole Genome screen - if the entire genome/exome is sequenced.

[13:M] GENOMIC_MUTATION_ID - Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome. This identifier is trackable and stable between different versions of the release.

[14:N] LEGACY_MUTATION_ID - Legacy mutation identifier (COSN) that will represent existing COSN mutation identifiers.

[15:O] Zygosity - Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample.

[16:P] Genome Version - The coordinate system used -

    38 = GRCh38/Hg38
    37 = GRCh37/Hg19

[17:Q] Genome coordinates - The genomic coordinate of the mutation.

[18:R] Mutation somatic status - Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin -

    Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient.
    variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal.
    Previously observed = when the mutation has been reported as somatic previously but not in current paper.

[19:S] WT SEQ - wild type sequence.

[20:T] MUT SEQ - Mutated sequence.

[21:U] Whole Genome Reseq - if the enitre genome is sequenced.

[22:V] Whole_Exome - if the enitre exome is sequenced.

[23:W] Id Study - Lists the unique Ids of studies that have involved this non coding mutation.

[24:X] HGVSG - Human Genome Variation Society genomic syntax (3' shifted).

Raw Gene Expression


The platform used was the Affymetrix Human Genome U219 Array. This file contains all the raw gene expression data from the most current release of COSMIC Cell Lines Project in a tab separated file.

CosmicCLP_RawGeneExpression.tsv.gz

File Description

[column number:label] Heading

[1:A] Sample id,Sample name - A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA.

[3:C] Gene Name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

[4:D] Gene Expression - Expression level for the gene from the Affymetrix Human Genome U219 Array data.

CLP Mutation Tracking


A tab separated table listing the mapping all of COSMIC Cell Line Project's legacy mutations(COSMs) to the new genomic identifiers(COSVs). This file also helps to identify the transcripts and the accession numbers on which the current mutation is annotated on, along with the mutation type.

CosmicCLPMutationTracking.tsv.gz

File Description

[column number:label] Heading

[1:A] Is_canonical - To indentify the transcript, if it is a canonical transcript the column value would be a yes otherwise a no.

[2:B] Mutation_type - Type of mutation (coding, non-coding etc.)

[3:C] GRCH - Genome version of the mutation.

[4:D] MUTATION_ID - An internal mutation identifier to uniquely represent each mutation on a specific transcript on a given assembly build.

[5:E] LEGACY_MUTATION_ID - Legacy mutation identifier (COSM) that will represent existing COSM mutation identifiers.

[6:F] GENOMIC_MUTATION_ID - Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome. This identifier is trackable and stable between different versions of the release.

[7:G] Accession Number - The transcript identifier of the gene.

[8:H] Gene Name - The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier.

VCF files (coding and non-coding mutations)


VCF file of all coding mutations in the cell lines project.

VCF/CellLinesCodingMuts.vcf.gz


VCF file of all coding mutations( normalised ) in the cell line project. The file has the variants 5' shifted as per the VCF standard, and the info part contains the 3' shifted syntaxes for cds and genome, along with the unshifted variants in the OLD_VARIANT field.

VCF/CellLinesCodingMuts.normal.vcf.gz


VCF file of all non coding mutations in the cell lines project.

VCF/CellLinesNonCodingVariants.vcf.gz


VCF file of all non-coding variants( normalised ) in the cell lines project. The file has the variants 5' shifted as per the VCF standard, and the info part contains the 3' shifted syntaxes for cds and genome, along with the unshifted variants in the OLD_VARIANT field.

VCF/CellLinesNonCodingVariants.normal.vcf.gz

QC


This file lists the SNP fingerprint (based on 97 SNPs using the Sequenom system), STR fingerprint (including repository information for matched samples where available) and MSI status of all 1025 cancer cell lines.

QC.xlsx

Sequence Coverage Statistics


The file lists the exome sequencing statistics for all cell lines.

seq_stats.xls

File Description

[column number:label] Heading

[1:A] READ 0 - % of bases not covered by any sequence

[2:B] READ 21 - % of bases covered by a minimum of 21 reads

[3:C] READ 41 - % of bases covered by a minimum of 41 reads

[4:D] Rpair - Total number of read pairs

[5:E] Gbp Seq - Total sequence

[6:F] UM Pairs - % of unmapped reads

[7:G] Gbp Map - Total of mapped sequence

[8:H] Mapped - Percentage of sequence mapped

[9:I] Gbp Uniq - Total of mapped unique reads

[10:J] Uniq - Percentage of mapped reads that are unique

Genotypes


Files listing the SNP calls for each cell line identified by PICNIC analysis of Affymetrix SNP6.0 array data. Both a simple genotype (AA, BB - homozygous or AB - heterozygous) and a complex interpretation of the genotype are given (for example, in a triploid region of the genome the genotype maybe AAB).

genotypes.tar

File Description

[column number:label] Heading

[1:A] Chr - Chromosome GRCh38/hg38

[2:B] pos - Genome Position GRCh38/hg38

[3:C] ncopies.A - Number of copies of allele A

[4:D] ncopies.B - Number of copies of allele B

[5:E] Probe.Set.ID - SNP6.0 probe ID

[6:F] dbSNP.RS.ID - dbSNP reference ID

[7:G] Allele.A - Genotype 'A' nucleotide

[8:H] Allele.B - Genotype 'B' nucleotide

[9:I] chr_b36 - Chromosome NCBI36/hg18

[10:J] pos_b36 - Genome Position NCBI36/hg18

[11:K] chr_b37 - Chromosome GRCh37/hg19

[12:L] pos_b37 - Genome Position GRCh37/hg19

[13:M] complexGenotype - a complex interpretation of the genotype eg in a triploid region the genotype maybe AAB

[14:N] simpleGenotype - a simple genotype eg AA, BB - homozygous or AB - heterozygous

Fasta File (genes)


CDS sequence for all the genes in Cell Line Project.

All_CellLines_Genes.fasta.gz