What is The Cell Lines Project?
For decades, human immortal cancer cell lines have constituted an accessible, easily usable set of biological models with which to investigate cancer biology and to explore the potential efficacy of anticancer drugs. However, cancer cell lines have been subject to criticism because they may represent a highly selected subgroup of the cancer classes from which they have been derived and may have acquired additional genetic abnormalities in vitro. Moreover, certain cancer cell lines are known to have contaminated others and thus the provenance of a cancer cell line is not always clear.
In order to improve their utility the Cancer Genome Project has embarked on a systematic characterisation of the genetics and genomics of large numbers of cancer cell lines. Prior knowledge of their genetic abnormalities may allow more informed choice of cancer cell lines in biological experiments and drug testing and more informed interpretation of results.
In release 67 there was a major update to the COSMIC Cell Lines Project. The capillary sequencing data of 800 cell lines across 64 genes was archived (available to download here). In its place, the project and website was replaced with data from the full exome sequencing of 1015 cancer cell lines.
These cancer cell lines are from major publicly accessible repositories from around the world together with a few lines not publically available. This set has been designed to encompass a broad range of tumour types and includes most cell lines that have been used extensively in cancer research, including the NCI-60 set. All the cell lines in the set have been ‘fingerprinted’ using a panel of 94 SNPs (single nucleotide polymorphisms) and 16 STRs (short tandem repeats) and are genetically unique. Where available the STR profiles have been matched to those published by the cell line repositories (see the QC.xlsx file).
Exome sequence was obtained for each cell line using the ‘SureSelectXT Human All Exon 50Mb’ bait set sequenced using either the Illumina GAII or HiSeq DNA sequencers with 75bp paired end reads.
Variants were called using the Caveman1 and Pindel2 algorithms calling base substitution and small insertions/deletions (indels) respectively. This data was filtered such that only variants deemed of good quality are presented in the COSMIC Cell Lines Project website (described below), however the full list of variants called by these algorithms can be accessed as VCF files via the ‘Overview’ tab on the Sample page together with a file called ‘read_depth_data.xlxs’ that gives ‘bait’, ‘exon’ and ‘gene’ level read depth information. Additional data is also presented in the ‘Overview’ tab for the cell lines including genome wide copy number analysis and genotyping information obtained using the Affymetrix SNP6 array analysed using the PICNIC algorithm. The BAM files and SNP6 .cel files can be accessed via the European Genome-phenome Archive Study: EGAS00001000978. Expression data derived from duplicate assays using the Affymetrix U219 expression array is due for release on 1st July 2015 at ArrayExpress, accession number E-MTAB-3610.
Other intermediate scale sequencing projects have also been carried out across some of the cell lines within the set (Cancer Cell Line Encyclopaedia (CCLE)5,6 and sequencing of the NCI-60 set by the NCI7) and comparison of the data across these studies identified a large degree of overlap between the different datasets. However, due to different algorithms being used to call variants some differences between sets do occur.
Variants seen across multiple datasets or those checked by capillary sequencing (including the previous version of the ‘Cell Lines Project’, archived here) have been scored as ‘validated’.
For every cell line in the COSMIC Cell Lines Project database we have released VCF files that list all variants identified by our variant calling algorithms (Pindel and Caveman). Variants presented in the vcf files have been screened by a series of post processing filters which flag likely false positive calls based on such criteria as read position and read/base quality and whether the variant was seen in a small set of normal samples (n=60) etc. Variants that fail one or more of these filters are annotated in the ‘filter’ column of the VCF file (a description of the filter is given in the header). Only variants that are marked ‘PASS’ in the filter column are taken forward.
Additional filters are then applied to these ‘PASS’ed variants to remove somatic or low confidence variants (described below) prior to entry to the COSMIC database, therefore if a variant is classed as ‘PASS’ in the VCF file but not shown in the COSMIC database it is because it failed one of these downstream filters.
1. Germline filter
Additional germline variants were identified and excluded by comparison to ~8000 normal datasets sourced from:
- 1000 genomes (released March 29th 2012)4 - variants with a frequency > 0.0014 have been removed.
- ESP6500 (released June 20th 2012)3 - variants with a frequency >= 0.00025 have been removed.
- DBSNP (Ensembl 58) - SNPs that have a minor allele frequency have been removed.
- In-house normal set (n=350) - variants seen in more than 1 normal have been removed.
High confidence variants were classed as those which passed the following criteria as these were shown to have a >85% likelihood of being real:
- Read depth =>15
- Mutant allele burden =>15%
- Variant not seen at any level in the reference normal samples used by the variant calling algorithms.
The variants that fail these 'confidence criteria' are only entered into the COSMIC Cell Lines Project database if validated by an independent experiment or study.
User defined filtering
Over time we have added filters on the website which allow users to select those variants within the cell lines that are more likely to contribute to carcinogenesis.
These include -
- * Variants in genes known to contribute to cancer and therefore present in the Cancer Gene Census
- * Variants within the cell lines that are similar to variants seen recurrently in whole genome screened tumour samples.
- * Mutation impact on the protein as determined by FATHMM.
Recurrence is defined by counting whole genome screened tumour samples, according to the mutation type -
Substitutions: ≥ 3 samples with a missense substitution in the same codon
Inframe Indels: ≥ 3 samples with an inframe indel in the same codon
Terminations: > 10 samples with a mutation causing premature protein termination
The mutation impact filters introduced in COSMIC v73 have been derived from the new FATHMM-MKL algorithm. This algorithm predicts the functional, molecular and phenotypic consequences of protein missense variants using hidden Markov models.
More information about FATHMM-MKL is available here
The new method improves on the older version of FATHMM and now incorporates ENCODE annotation for its prediction. This method is as powerful as CADD scores for coding variants and shows improved prediction for non-coding variants (compared to GWAVA and CADD).
The functional scores for individual mutations from FATHMM-MKL are in the form of a single p-value, ranging from 0 to 1. Scores above 0.5 are deleterious, but in order to highlight the most significant data in COSMIC, only scores ≥ 0.7 are classified as 'Pathogenic'. Mutations are classed as 'Neutral' if the score is ≤ 0.5. In addition, each functional score is classified into 10 groups of features, depending on whether it is a coding or non-coding variant. Please see the original publication for more details regarding the feature classification (doi:10.1093/bioinformatics/btv009).
The following is reproduced from the publication in order to aid interpretation:
Description for each of the feature groups [A-J]
- A. 46-Way Sequence Conservation: based on multiple sequence alignment scores, at the nucleotide level, of 46 vertebrate genomes compared with the human genome.
- B. Histone Modifications (ChIP-Seq): based on ChIP-Seq peak calls for histone modifications.
- C. Transcription Factor Binding Sites (TFBS PeakSeq): based on PeakSeq peak calls for various transcription factors.
- D. Open Chromatin (DNase-Seq): based on DNase-Seq peak calls.
- E. 100-Way Sequence Conservation: based on multiple sequence alignment scores, at the nucleotide level, of 100 vertebrate genomes compared with the human genome.
- F. GC Content: based on a single measure for GC content calculated using a span of five nucleotide bases from the UCSC Genome Browser.
- G. Open Chromatin (FAIRE): based on formaldehyde-assisted isolation of regulatory elements (FAIRE) peak calls.
- H. Transcription Factor Binding Sites (TFBS SPP): based on SPP peak calls for various transcription factors.
- I. Genome Segmentation: based on genome-segmentation states using a consensus merge of segmentations produced by the ChromHMM and Segway software.
- J. Footprints: based on annotations describing DNA footprints across cell types from ENCODE.
Please note: The current FATHMM-MKL algorithm is trained on the human gene mutation database (The HGMD database http://www.hgmd.cf.ac.uk/ac/index.php), which now also contains somatic variants. Results from the current available version of FATHMM-MKL can be used/has been used for somatic variants, but the user should be aware of the caveats. The cancer specific version of FATHMM-MKL is under development and when available these scores will be updated.
- Papaemmanuil E, et al (2011). N Engl J Med 365(15):1384-1395.
- Ye K, Schulz MH, Long Q, Apweiler R, & Ning Z (2009). Bioinformatics 25(21):2865-2871.
- Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA (URL: http://evs.gs.washington.edu/EVS/) [Accessed 13/11/2012].
- Barretina J, et al (2012). Nature. 483(7391):603-607.
- Abaan OD, Cancer Res (2013) 73(14):4372-4382.