RBP-eCLIP Peak Annotation | Bioinformatics eBlog

The goal of RBP-eCLIP is to identify where an RNA-binding protein (RBP) is binding; these regions are often called peaks due to their mountain-like appearance on a genome browser. After peaks have been called, it is important to determine what genes and gene features are associated with those sites. For example, an RBP binding to an intron of a gene likely has a different regulatory role than a different RBP that binds to the 3’ UTR of the same gene. Determining how to annotate an RBP peak is a non-trivial task, and in this post we will walk you through how we perform this analysis.

The first step in annotating peaks is to obtain a high-quality reference for where different genes are in the genome. At Eclipsebio we often use GENCODE or Ensembl references, both of which have been developed over many years to contain well validated gene models. For each gene, we have identified the regions that correspond to different features such untranslated regions (UTRs), introns, and coding sequences (CDS). A diagram of a basic gene model is in figure 1 and an explanation of the different features we examine is in table 1.

Figure 1: Gene features used during peak annotation

RegionDefinitionPotential Function
Coding sequence (CDS)Sequence of an mRNA that codes for a protein– mRNA stability (2)
– Translation efficiency (2,3)
– Splicing (3)
5’ UTRUntranslated region between the transcription start site and start codon– Transcript initiation (3)
– Translation regulation (3)
3’ UTRUntranslated region between the stop codon and transcription end site– mRNA stability (5)
– miRNA regulation (4)
– Translation efficiency (6)
Splice siteIntronic region that is within the first 100 (5’) or last 100 (3’) nt of an intron– Splicing (1)
Proximal intronIntronic region that is between 100 and 500 nt from the nearest exon– Splicing (1)
– Localization (1)
Distal intronIntronic region that is more than 500 nt from the nearest exon– Transcription regulation (3)
– Pre-mRNA processing (3)
Table 1: Definitions and potential functions of different gene features


Although one could perform a simple overlap to identify which peaks are associated with gene features, this can lead to problematic assignments as there will often be cases where genes overlap. This can lead to a given site being in the CDS of one gene and in the intron of another. To help solve this issue we have developed a hierarchy for peak assignments where regions are prioritized in the order of CDS > UTR > introns > non-coding exons > non-coding introns(1).

It can be challenging to set-up an accurate annotation of eCLIP peaks. Luckily, Eclipsebio is here to help. Our analysis pipelines provide robust labeling of peaks across different sample types and genomes. Contact us today about how we can help you achieve your research goals.

References:
1. Van Nostrand et al. (2020)
2. Grzybowska and Wakula (2021)
3. Van Nostrand et al. (2020)
4. Plass, Rasmussen, and Krogh (2017)
5. Mayya and Duchaine (2019)
6. Szostak and Gebauer (2012)

Related articles

eBlogs

09
20
23

RBP-eCLIP Motif Calling | Bioinformatics eBlog

The simplest definition of a motif is a short, patterned sequence of nucleotides that play some role in the biology of a system. In the case of RBPs, this role is to bind selectively to defined regions of a given RBP’s protein structure enabling RBPs to target specific transcripts and specific gene features... [READ MORE]

read more

eBlogs

09
15
23

RBP-eCLIP Peak Calling | Bioinformatics eBlog

The simplest definition of a motif is a short, patterned sequence of nucleotides that play some role in the biology of a system. In the case of RBPs, this role is to bind selectively to defined regions of a given RBP’s protein structure enabling RBPs to target specific transcripts and specific gene features.... [READ MORE]

read more

eBlogs

09
01
23

eRibo Pro Peak Differential Expression | Bioinformatics eBlog

A DE analysis is a statistical procedure that identifies differentially up or downregulated genes between two or more conditions or samples. It involves comparing the expression levels of each gene in one group of samples (e.g., disease samples) to the expression levels in another (e.g., healthy samples) to identify genes that have changed across conditions.... [READ MORE]

read more

eBlogs

07
27
23

Stranded Libraries | Bioinformatics eBlog

Forward, reverse, sense, antisense, first strand, second strand, unstranded. Different methods for sequencing RNA-Seq data can lead to differently stranded libraries all with different names. This can make it challenging to figure out how different kits compare to one another or what parameters to use with different software tools to make sure you are doing an analysis correctly.... [READ MORE]

read more