Cancer Genome Annotation
Exclusion of hypermutated samples and SNP filtering:
Cancer genomes can be a very noisy source of data. It is estimated that an individual's tumour is caused by 5-10 driver mutations, but genome resequencing regularly reveals over 10,000 somatic mutations per tumour, with much larger numbers not unusual in hypermutated samples (we've seen samples with over 100,000 mutations each, the greatest being 178,763).
Across studies from different groups using different techniques, it is unclear whether these huge numbers reflect true hypermutation, substantial germline variation or technical artefacts. To try to improve the value of these data, we are beginning to define a cancer genome noise reduction strategy. Initially, we will exclude any sample with over 15,000 mutations, as this immediately introduces huge noise into COSMIC; these can be reintroduced at a later date.
In addition, we are removing all known SNPs from new genome uploads (initially, these are defined by the 1000 genomes project and a panel of normal (non-cancer) samples from Sanger CGP sequencing). Although these SNPs are excluded from the COSMIC website they are included in the data files available on our SFTP site and can be viewed by switching on the 'SNPs' track in the COSMIC Genome Browser. In the future, we will be assessing how to enhance these filters and best apply them to our curated genes.
Ultimately, we aim to identify the most significant high-value data within cancer genomes, making it much easier to identify actionable biomarkers.
The mutation impact filters introduced in COSMIC v73 have been derived from the new FATHMM-MKL algorithm. This algorithm predicts the functional, molecular and phenotypic consequences of protein missense variants using hidden Markov models.
More information about FATHMM-MKL is available here
The new method improves on the older version of FATHMM and now incorporates ENCODE annotation for its prediction. This method is as powerful as CADD scores for coding variants and shows improved prediction for non-coding variants (compared to GWAVA and CADD).
The functional scores for individual mutations from FATHMM-MKL are in the form of a single p-value, ranging from 0 to 1. Scores above 0.5 are deleterious, but in order to highlight the most significant data in COSMIC, only scores ≥ 0.7 are classified as 'Pathogenic'. Mutations are classed as 'Neutral' if the score is ≤ 0.5. In addition, each functional score is classified into 10 groups of features, depending on whether it is a coding or non-coding variant. Please see the original publication for more details regarding the feature classification (doi:10.1093/bioinformatics/btv009).
The following is reproduced from the publication in order to aid interpretation:
Description for each of the feature groups [A-J]
- A. 46-Way Sequence Conservation: based on multiple sequence alignment scores, at the nucleotide level, of 46 vertebrate genomes compared with the human genome.
- B. Histone Modifications (ChIP-Seq): based on ChIP-Seq peak calls for histone modifications.
- C. Transcription Factor Binding Sites (TFBS PeakSeq): based on PeakSeq peak calls for various transcription factors.
- D. Open Chromatin (DNase-Seq): based on DNase-Seq peak calls.
- E. 100-Way Sequence Conservation: based on multiple sequence alignment scores, at the nucleotide level, of 100 vertebrate genomes compared with the human genome.
- F. GC Content: based on a single measure for GC content calculated using a span of five nucleotide bases from the UCSC Genome Browser.
- G. Open Chromatin (FAIRE): based on formaldehyde-assisted isolation of regulatory elements (FAIRE) peak calls.
- H. Transcription Factor Binding Sites (TFBS SPP): based on SPP peak calls for various transcription factors.
- I. Genome Segmentation: based on genome-segmentation states using a consensus merge of segmentations produced by the ChromHMM and Segway software.
- J. Footprints: based on annotations describing DNA footprints across cell types from ENCODE.
Please note: The current FATHMM-MKL algorithm is trained on the human gene mutation database (The HGMD database http://www.hgmd.cf.ac.uk/ac/index.php), which now also contains somatic variants. Results from the current available version of FATHMM-MKL can be used/has been used for somatic variants, but the user should be aware of the caveats. The cancer specific version of FATHMM-MKL is under development and when available these scores will be updated.
Copy Number Variants (CNV)
For Cancer Genome Project data (including the Cell Lines Project) copy number analysis was carried out using the Affymetrix SNP6.0 array in conjunction with a bespoke algorithm (PICNIC: Predicting Integral Copy Numbers In Cancer).
Where available, copy number data from TCGA and ICGC have been included in COSMIC (for samples already present in the database ie samples with mutations). All TCGA data included in COSMIC has been reanalysed using ASCAT 2.4. Please refer to Peter Van Loo et al. PNAS 107:16910-16915, 2010. for more information.
Definition of Minor Allele and Copy Number in tables:
- Minor Allele: the number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies)
- Copy Number: the sum of the major and minor allele counts eg if ABB, copy number = 3
Definition of Gain and Loss:
We have introduced filtering thresholds to only display CNVs which are high level amplifications, homozygous deletions, or where there has been 'substantial loss' within an otherwise duplicated genome. We also use a higher threshold for amplification if genome duplication has occurred. We use average ploidy > 2.7 to define genome duplication.
- ICGC: Gain and Loss as defined in the original data.
- TCGA: (reanalysed with ASCAT 2.4,Peter Van Loo et al. PNAS 107:16910-16915, 2010. ) and
Cell Lines Project ( Affymetrix SNP6.0 array data analysed with PICNIC)
- average genome ploidy <= 2.7 AND total copy number >= 5
- OR average genome ploidy > 2.7 AND total copy number >= 9
- average genome ploidy <= 2.7 AND total copy number = 0
- OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )
Where available, copy number data from TCGA and ICGC have been included in COSMIC (for samples already present in the database ie samples with mutations). All TCGA data included in COSMIC has been reanalysed using ASCAT 2.4. Please refer to for more information.
Gene expression level 3 data has been downloaded from the publicly accessible TCGA portal. The platform codes currently used to produce the COSMIC gene expression values are: IlluminaHiSeq_RNASeqV2, IlluminaGA_RNASeqV2, IlluminaHiSeq_RNASeq, and IlluminaGA_RNASeq.
Please note that as from COSMIC v71 we no longer show results from the array platforms AgilentG4502A_07_2 and AgilentG4502A_07_3. By using only RNAseq data we can show more results. This is because disagreement between the array and RNAseq data was quite common and resulted in the exclusion of data (see 'Qualitative merging of results' below).
For the RNASeq platforms we used the .trimmed.annotated.gene.quantification.txt, files which contain Level 3 expression data and used RPKM as a method of quantifying gene expression from RNA sequencing data by normalizing for total read length and the number of sequencing reads.[https://wiki.nci.nih.gov/display/TCGA/RNASeq]
For the RNASeqV2 platforms, the files used were rsem.genes.normalized_results, which contain Level 3 expression data produced using MapSplice to do the alignment and RSEM to perform the quantitation. [https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2]
Qualitative merging of results
We downloaded methylation data for TCGA studies from the ICGC portal that were produced using the Infinium HumanMethylation450 beadchip. Only TCGA studies were downloaded as they include normal samples which are used to predict differential methylation. For the statistical test to be valid only studies with > 19 normal samples were analysed.
GRCh37/Hg19 genomic coordinates were derived from the probe description file from illumina. We have used hgLiftOver to map these loci on to the new GRCh28/Hg38 genome assembly.
Background (TCGA literature)
Then we corrected for multiple testing using the Bonferroni correction as follows:
the p-value of each locus (CpG) is multiplied by the total number of CpGs in the list.
If the corrected p-value is still below the error rate, the locus will be considered significant:
Corrected P-value= p-value * n (number of CpGs in the test) <0.05
In practice this means that a p-value < 0.0000001655 is significant.
Qualitative Representation of Results
Details of the anaysis performed can be found in Alexandrov L.B et al., Nature. 22;500(7463):415-21 (2013)
Reanalysis of TCGA data by the Cancer Genome Project (CGP), Sanger Institute
- Data was downloaded from CGhub as compressed summary TSV files.
The following settings were used:
- By Library Type = WXS
- By Platform = Illumina
- State = Live
- Remapping was performed using BWA-MEM.
- Pindel and CaVEMan were run on the tumour normal pairs.
- MAF, VCF and MAGE-TAB files were generated from the Pindel and CaVEMan output.
- Mutations from the VCF files were imported into COSMIC.