Cancer Genome Annotation


Exclusion of hypermutated samples:

Cancer genomes can be a very noisy source of data. It is estimated that an individual's tumour is caused by 5-10 driver mutations, but genome resequencing regularly reveals over 10,000 somatic mutations per tumour, with much larger numbers not unusual in hypermutated samples (we've seen samples with over 100,000 mutations each, the greatest being 178,763).

Across studies from different groups using different techniques, it is unclear whether these huge numbers reflect true hypermutation, substantial germline variation or technical artefacts. To try to improve the value of these data, we exclude any sample with over 15,000 mutations, as this immediately introduces huge noise into COSMIC.

Mutation Impact:

We have also been developing and refining strategies to highlight the most significant mutations. Initially we flagged variants previosuly identified as SNPs, and added FATHMM scores to predict the relative significance of mutations. These methods have now been superseded by the The Cancer Mutation Census (CMC), which was released in v92. This identifies coding variants which drive different types of cancer. The CMC integrates all coding somatic mutations collected by COSMIC with biological and biochemical information from multiple sources, combining data obtained from manual curation and computational analyses. Metrics like ClinVar significance, dN/dS ratios, and variant frequencies in normal populations (gnomAD) have been integrated into this resource.

Ultimately, we aim to identify the most significant high-value data within cancer genomes, making it much easier to identify actionable biomarkers.

Copy Number Variants (CNV)

For Cancer Genome Project data (including the Cell Lines Project) copy number analysis was carried out using the Affymetrix SNP6.0 array in conjunction with a bespoke algorithm (PICNIC: Predicting Integral Copy Numbers In Cancer).

Where available, copy number data from TCGA and ICGC have been included in COSMIC (for samples already present in the database ie samples with mutations). All TCGA data included in COSMIC has been reanalysed using ASCAT 2.4. Please refer to Peter Van Loo et al. PNAS 107:16910-16915, 2010. for more information.

Definition of Minor Allele and Copy Number in tables:

  • Minor Allele: the number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies)
  • Copy Number: the sum of the major and minor allele counts eg if ABB, copy number = 3

Definition of Gain and Loss:

We have introduced filtering thresholds to only display CNVs which are high level amplifications, homozygous deletions, or where there has been 'substantial loss' within an otherwise duplicated genome. We also use a higher threshold for amplification if genome duplication has occurred. We use average ploidy > 2.7 to define genome duplication.

  • ICGC: Gain and Loss as defined in the original data.
  • TCGA: (reanalysed with ASCAT 2.4,Peter Van Loo et al. PNAS 107:16910-16915, 2010. ) and Cell Lines Project ( Affymetrix SNP6.0 array data analysed with PICNIC)
    • Gain:
      • average genome ploidy <= 2.7 AND total copy number >= 5
      • OR average genome ploidy > 2.7 AND total copy number >= 9
    • Loss:
      • average genome ploidy <= 2.7 AND total copy number = 0
      • OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )

Where available, copy number data from TCGA and ICGC have been included in COSMIC (for samples already present in the database ie samples with mutations). All TCGA data included in COSMIC has been reanalysed using ASCAT 2.4. Please refer to Peter Van Loo et al. PNAS 107:16910-16915, 2010 for more information.

Gene Expression

Gene expression level 3 data has been downloaded from the publicly accessible TCGA portal. The platform codes currently used to produce the COSMIC gene expression values are: IlluminaHiSeq_RNASeqV2, IlluminaGA_RNASeqV2, IlluminaHiSeq_RNASeq, and IlluminaGA_RNASeq.

Please note that as from COSMIC v71 we no longer show results from the array platforms AgilentG4502A_07_2 and AgilentG4502A_07_3. By using only RNAseq data we can show more results. This is because disagreement between the array and RNAseq data was quite common and resulted in the exclusion of data (see 'Qualitative merging of results' below).

For the RNASeq platforms we used the .trimmed.annotated.gene.quantification.txt, files which contain Level 3 expression data and used RPKM as a method of quantifying gene expression from RNA sequencing data by normalizing for total read length and the number of sequencing reads.[]

For the RNASeqV2 platforms, the files used were rsem.genes.normalized_results, which contain Level 3 expression data produced using MapSplice to do the alignment and RSEM to perform the quantitation. []


The mean and sample standard deviation of the gene expression values have been calculated from the Tumour samples that are diploid for each corresponding gene, platform, study. Based on these mean and STDEV values we have calculated the standard scores for gene expression for each corresponding gene, platform, and study.

Qualitative merging of results

Qualitative merging of results, per study(project_code) across analysis platforms. In order to display if a gene is over or under expressed, a threshold of 2 STDEV, plus or minus was selected. In the cases that a sample has been analysed with more than one platform for the specific study and gene where the scores from all platforms are above or below the threshold then we display over or under. If they do not agree then we do not display it. The z_score displayed across thew website (serves as an indicative score of expression level) is taken from one platform in order of preference: IlluminaHiSeq_RNASeqV2, IlluminaGA_RNASeqV2, IlluminaHiSeq_RNASeq, IlluminaGA_RNASeq


We downloaded methylation data for TCGA studies from the ICGC portal that were produced using the Infinium HumanMethylation450 beadchip. Only TCGA studies were downloaded as they include normal samples which are used to predict differential methylation. For the statistical test to be valid only studies with > 19 normal samples were analysed.

GRCh37/Hg19 genomic coordinates were derived from the probe description file from illumina. We have used hgLiftOver to map these loci on to the new GRCh28/Hg38 genome assembly.

Background (TCGA literature)

LEVEL 3: Derived summary measures (beta values: M/(M+U) for each interrogated locus) with annotations for gene symbol, chromosome (UCSC hg19, Feb 2009), and CpG/CpH coordinate (UCSC hg19, Feb 2009). Probes having a SNP within 10bp of the interrogated CpG site or having 15bp from the interrogated CpG site overlap with a REPEAT element (as defined by RepeatMasker and Tandem Repeat Finder Masks based on UCSC hg19, Feb 2009) are masked as NA across all samples, and probes with a non-detection probability (P-value) greater than 0.05 in a given sample are masked as NA on that chip. Probes that are mapped to multiple sites on hg19 are annotated as NA for chromosome and 0 for CpG/CpH coordinate


The differential methylation analysis was done by comparing the beta-values from tumour and normal populations for each locus (probe/CpG) and each study using the Mann-Whitney test.
Then we corrected for multiple testing using the Bonferroni correction as follows:
the p-value of each locus (CpG) is multiplied by the total number of CpGs in the list.
If the corrected p-value is still below the error rate, the locus will be considered significant:
Corrected P-value= p-value * n (number of CpGs in the test) <0.05
In practice this means that a p-value < 0.0000001655 is significant.

Qualitative Representation of Results

We classify the methylation level as High, Medium, Low ( beta-value > 0.8, 0.2-08, < 0.2 repectively) and the methylation state (altered=Y or N). For each locus, the state is defined as ‘altered’ when the absolute difference between the average beta value in the normal population and tumour sample is > 0.5. Please note that on the website and in the ftp download file CosmicCompleteDifferentialMethylation we only display results for loci where the p-value < 0.0000001655 and where the methylation level is High or Low and the state is ‘altered’.

Mutation Signatures

Details of the anaysis performed can be found in Alexandrov L.B et al., Nature. 22;500(7463):415-21 (2013)

Reanalysis of TCGA data by the Cancer Genome Project (CGP), Sanger Institute

  1. Data was downloaded from CGhub as compressed summary TSV files.
    The following settings were used:
    • By Library Type = WXS
    • By Platform = Illumina
    • State = Live

  2. Remapping was performed using BWA-MEM.
  3. Pindel and CaVEMan were run on the tumour normal pairs.
  4. MAF, VCF and MAGE-TAB files were generated from the Pindel and CaVEMan output.
  5. Mutations from the VCF files were imported into COSMIC.