The dark genome: new sources of cancer proteins?

The human genome is conventionally divided into the “coding” genome, which generates the ~20,000 annotated human protein coding genes, and the “dark” genome, which does not encode proteins.  The dark genome is a vast space, accounting for the ~98.5% of genomic space where repeat elements, enhancers, regulatory sequences, and non-coding RNAs reside.

Historically, a major effort of human genomics has been to define the complement of human protein coding genes, which are then the basis for biomedical discovery and research.  Following positional cloning techniques, the 1980s saw the emergence of gene cloning as a major field of research.  Major human disease genes such as von Willebrand factor1 and the cystic fibrosis gene CFTR2,3 were cloned in this way.  In the 1990s, high-throughput analyses of complementary DNA (cDNA) provided early insights into human gene structure globally.  Methods such as expressed sequence tag (EST) sequencing4 and serial analysis of gene expression (SAGE)5 revolutionized the ability to detect genes and understand exon structure based on spliced mRNAs.  Such approaches lead to the belief that there were ~35,000 – 100,000 human protein coding genes6,7.

The publication of the Human Genome Project8  (HGP) in 2001 was the culmination of these efforts.  Simultaneously, the HGP both dramatically expanded the number of annotated human protein coding genes and dramatically reduced that number from ~100,000 to ~20,000.  It has remained at ~20,000 ever since.

With the advent of next generation sequencing methods, Ingolia et al. developed a method to sequence genome-wide footprints of ribosome engagement, termed Ribo-seq9.  Beyond confirming the translation of annotated proteins, this method has revolutionized genomic understanding by revealing thousands of genomic sites of ribosome translation in the “dark” genome.  The central question, therefore, is how to interpret these data: biological noise? new proteins? a faulty technique?

We have viewed these data as an opportunity to explore the limits of the functional genome.  Ribo-seq analyses have nominated thousands of putative sites of translation across the genome10,11.  In our study, published online in Nature Biotechnology12, we set out to test many of these Ribo-seq ORFs (open reading frames) for function in human cancer cells.  We performed three major assays:

  1. Assessment of the ability to translate a stable protein
  2. The ability to change cellular state, as measured by gene expression changes
  3. The requirement of a Ribo-seq ORF for maintenance of cancer cell growth.

Surprisingly, of 553 tested Ribo-seq ORFs, we found that ~50% produced a stably-detected protein, ~30% impacted cell gene expression, and ~10% were required for cancer cell viability.

These results suggest that there may be a sizable collection of functional peptides present in human diseases that are not captured by the ~20,000 annotated protein coding genes.  While further study is needed to establish whether additional groups of Ribo-seq ORFs will be similarly enriched for potential functions, we are motivated by the possibility that the human genome may harbor a new layer of unexplored biology.  Our data provide motivation to continue to search the “dark” genome for new sources of human biology that may be relevant in complex diseases such as cancer.


1. Ginsburg, D. et al. Human von Willebrand factor (vWF): isolation of complementary DNA (cDNA) clones and chromosomal localization. Science 228, 1401–1406 (1985).

2. Riordan, J. R. et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245, 1066–1073 (1989).

3. Kerem, B. et al. Identification of the cystic fibrosis gene: genetic analysis. Science vol. 245 1073–1080 (1989).

4. Adams, M. D. et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991).

5. Velculescu, V. E., Zhang, L., Vogelstein, B. & Kinzler, K. W. Serial analysis of gene expression. Science 270, 484–487 (1995).

6. Ewing, B. & Green, P. Analysis of expressed sequence tags indicates 35,000 human genes. Nat. Genet. 25, 232–234 (2000).

7. Liang, F. et al. Gene Index analysis of the human genome estimates approximately 120,000 genes. Nat. Genet. 25, 239–240 (2000).

8. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

9. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. S. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).

10. Ji, Z., Song, R., Regev, A. & Struhl, K. Many lncRNAs, 5’UTRs, and pseudogenes are translated and some are likely to express functional proteins. Elife 4, e08890 (2015).

11. Martinez, T. F. et al. Accurate annotation of human protein-coding small open reading frames. Nat. Chem. Biol. 16, 458–468 (2020).

12. Prensner, J. R., Enache, O. M., Luria, V., Krug, K. & Clauser, K. R. Non-canonical open reading frames encode functional proteins essential for cancer cell survival. Nature Biotechnology (2021).

You might also enjoy


Exercise and Our Genes

Exercise is a major factor in leading a healthy lifestyle, providing both physical and mental health benefits. However, have you ever found yourself struggling to

Read More »