Cancer Genome Annotation


For every cell line in the COSMIC Cell Lines Project database we have released VCF files that list all variants identified by our variant calling algorithms (Pindel and Caveman). Variants presented in the vcf files have been screened by a series of post processing filters which flag likely false positive calls based on such criteria as read position and read/base quality and whether the variant was seen in a small set of normal samples (n=60) etc. Variants that fail one or more of these filters are annotated in the ‘filter’ column of the VCF file (a description of the filter is given in the header). Only variants that are marked ‘PASS’ in the filter column are taken forward.

Additional filters are then applied to these ‘PASS’ed variants to remove somatic or low confidence variants (described below) prior to entry to the COSMIC database, therefore if a variant is classed as ‘PASS’ in the VCF file but not shown in the COSMIC database it is because it failed one of these downstream filters.

1. Germline filter

Additional germline variants were identified and excluded by comparison to ~8000 normal datasets sourced from:

  • 1000 genomes (released March 29th 2012)4 - variants with a frequency > 0.0014 have been removed.
  • ESP6500 (released June 20th 2012)3 - variants with a frequency >= 0.00025 have been removed.
  • DBSNP (Ensembl 58) - SNPs that have a minor allele frequency have been removed.
  • In-house normal set (n=350) - variants seen in more than 1 normal have been removed.

2.Confidence filter

High confidence variants were classed as those which passed the following criteria as these were shown to have a >85% likelihood of being real:

  • Read depth =>15
  • Mutant allele burden =>15%
  • Variant not seen at any level in the reference normal samples used by the variant calling algorithms.

The variants that fail these 'confidence criteria' are only entered into the COSMIC Cell Lines Project database if validated by an independent experiment or study.

User defined filtering

Over time we have added filters on the website which allow users to select those variants within the cell lines that are more likely to contribute to carcinogenesis.

These include -

  •   * Variants in genes known to contribute to cancer and therefore present in the Cancer Gene Census
  •   * Variants within the cell lines that are similar to variants seen recurrently in whole genome screened tumour samples.
  •   * Mutation impact on the protein as determined by FATHMM.


Recurrence is defined by counting whole genome screened tumour samples, according to the mutation type -

  Substitutions:     ≥ 3 samples with a missense substitution in the same codon
  Inframe Indels:   ≥ 3 samples with an inframe indel in the same codon
  Terminations:      > 10 samples with a mutation causing premature protein termination

Mutation Impact

The mutation impact filters introduced in COSMIC v73 have been derived from the new FATHMM-MKL algorithm. This algorithm predicts the functional, molecular and phenotypic consequences of protein missense variants using hidden Markov models.

More information about FATHMM-MKL is available here

The new method improves on the older version of FATHMM and now incorporates ENCODE annotation for its prediction. This method is as powerful as CADD scores for coding variants and shows improved prediction for non-coding variants (compared to GWAVA and CADD).

The functional scores for individual mutations from FATHMM-MKL are in the form of a single p-value, ranging from 0 to 1. Scores above 0.5 are deleterious, but in order to highlight the most significant data in COSMIC, only scores ≥ 0.7 are classified as 'Pathogenic'. Mutations are classed as 'Neutral' if the score is ≤ 0.5. In addition, each functional score is classified into 10 groups of features, depending on whether it is a coding or non-coding variant. Please see the original publication for more details regarding the feature classification (doi:10.1093/bioinformatics/btv009).

The following is reproduced from the publication in order to aid interpretation:

Description for each of the feature groups [A-J]

  • A. 46-Way Sequence Conservation: based on multiple sequence alignment scores, at the nucleotide level, of 46 vertebrate genomes compared with the human genome.
  • B. Histone Modifications (ChIP-Seq): based on ChIP-Seq peak calls for histone modifications.
  • C. Transcription Factor Binding Sites (TFBS PeakSeq): based on PeakSeq peak calls for various transcription factors.
  • D. Open Chromatin (DNase-Seq): based on DNase-Seq peak calls.
  • E. 100-Way Sequence Conservation: based on multiple sequence alignment scores, at the nucleotide level, of 100 vertebrate genomes compared with the human genome.
  • F. GC Content: based on a single measure for GC content calculated using a span of five nucleotide bases from the UCSC Genome Browser.
  • G. Open Chromatin (FAIRE): based on formaldehyde-assisted isolation of regulatory elements (FAIRE) peak calls.
  • H. Transcription Factor Binding Sites (TFBS SPP): based on SPP peak calls for various transcription factors.
  • I. Genome Segmentation: based on genome-segmentation states using a consensus merge of segmentations produced by the ChromHMM and Segway software.
  • J. Footprints: based on annotations describing DNA footprints across cell types from ENCODE.

Please note: The current FATHMM-MKL algorithm is trained on the human gene mutation database (The HGMD database, which now also contains somatic variants. Results from the current available version of FATHMM-MKL can be used/has been used for somatic variants, but the user should be aware of the caveats. The cancer specific version of FATHMM-MKL is under development and when available these scores will be updated.

Copy Number Variants (CNV)

For the Cell Lines Project copy number analysis was carried out using the Affymetrix SNP6.0 array in conjunction with a bespoke algorithm (PICNIC: Predicting Integral Copy Numbers In Cancer).

Definition of Minor Allele and Copy Number in tables:

  • Minor Allele: the number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies)
  • Copy Number: the sum of the major and minor allele counts eg if ABB, copy number = 3

Definition of Gain and Loss:

We have introduced filtering thresholds to only display CNVs which are high level amplifications, homozygous deletions, or where there has been 'substantial loss' within an otherwise duplicated genome. We also use a higher threshold for amplification if genome duplication has occurred. We use average ploidy > 2.7 to define genome duplication.

  • Gain:
    • average genome ploidy <= 2.7 AND total copy number >= 5
    • OR average genome ploidy > 2.7 AND total copy number >= 9
  • Loss:
    • average genome ploidy <= 2.7 AND total copy number = 0
    • OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 )

Gene Expression

The platfprm used was the Affymetrix Human Genome U219 Array.

The array probe level data where normalised using the RMA method (Robust Multi-array Average). See Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Rafael A. Irizarry et al, 2003 for a description of the methodology.

The mean and sample standard deviation of the gene expression values have been calculated from the Tumour samples that are diploid for each corresponding gene.

Based on these mean and STDEV values we have calculated the standard scores for gene expression for each corresponding gene. In order to display if a gene is over or under expressed, a threshold of 2 STDEV, plus or minus was selected.