A2 Variant Calling

From Geuvadis MediaWiki
Jump to: navigation, search


Variant calling at known RNA editing sites

by T. Wieland

Since multisample variant calling for all possible sites is still going on, we decided to also call the genotypes at a set of known RNA editing sites separately. The list of known sites was taken from the UCSC RNA Editing track which uses information from a database called DARNED and includes 42,039 positions.

Variant calling

We performed multisample calling using SAMtools across all available 462 Samples that weren't excluded by QC steps. For performance reasons we divided the editing sites per chromosome and ran the calling for the chromosomes in parallel. The call for chr1 looked like this:

samtools \
mpileup -C0 -m3 -F0.0002 \
-E \
-l chr1.darned.bed \
-d999999 \
-q20 \
-DSuf hg19.fa \
-b inputBams \
| bcftools view -cgv  \
-l chr1.darned.bed \
 - 2> chr1mpileup.log | vcfutils.pl varFilter -Q25 -d3 -D4999500 -a2 -w10 -W10 -10.0001 -21e-400 -30 -40.0001 -p > chr1not_filtered.vcf 2> chr1filtered.vcf

As you can see, the called variants were directly piped into the varFilter script that comes with SAMtools and filters for several parameters like minimum/maximum read depth, Hardy-Weinberg Equilibrium, ... A short description of the parameters, including standard values, can be obtained by running the script without any parameters:

Usage:   vcfutils.pl varFilter [options] <in.vcf>

Options: -Q INT    minimum RMS mapping quality for SNPs [10]
         -d INT    minimum read depth [2]
         -D INT    maximum read depth [10000000]
         -a INT    minimum number of alternate bases [2]
         -w INT    SNP within INT bp around a gap to be filtered [3]
         -W INT    window size for filtering adjacent gaps [10]
         -1 FLOAT  min P-value for strand bias (given PV4) [0.0001]
         -2 FLOAT  min P-value for baseQ bias [1e-100]
         -3 FLOAT  min P-value for mapQ bias [0]
         -4 FLOAT  min P-value for end distance bias [0.0001]
	 -e FLOAT  min P-value for HWE (plus F<0) [0.0001]
         -p        print filtered variants

Note: Some of the filters rely on annotations generated by SAMtools/BCFtools. 

The filtered variants were not discarded immediately, but only marked as filtered in the VCF file and kept for further analysis. Altogether SAMtools called non-reference variants (i.e. at least one sample had a non-reference genotype) at 24,680 sites. Please Note: SAMtools calls non-reference genotypes also at sites without sufficient coverage, sometimes even at sites without reads at all, due to some kind of "local imputation" (See this paper for details).


To reduce the number false positive RNA editing events we applied a set of very stringent filters:

  • Minimum median coverage: We required a minimum median coverage of 10 at called sites.
  • Minimum number of non-reference genotypes: At least 10 samples had to have a non-reference genotype at each site.
  • varFilter passed: We required a variant to not be filtered out by the varFilter script.
  • Minimum variant quality: We required a minimum variant quality (assigned by SAMtools) of 100
  • No genomic variant: To ensure that the observed variants are true RNA editing events and not already present at the DNA level, we required two things:
    • There should not be a known genetic variant at this position, i.e. there should not be a corresponding variant in the Genetic variation data.
    • The reason that there is no corresponding variant in the genotype files might not be that there is no variant, but simply that it was not possible to call it due to bad coverage or quality of the region. The 1000 Genome Project offers files that state which regions of the human genome are accessible to current sequencing techniques by looking for regions in their data that fulfill certain quality criteria. Here we required the putative RNA editing events to lie within a region of the strict mask, because we assumed that if a variant was no RNA editing event but a DNA variant it would show up in the genotype files.

Final variant sets

After applying these filters only 113 variants remained. Currently there are three output files that can be found on the FTP server:

  • final.rna_editing.vcf: The final set of variants. It includes the number of reference/non-reference reads per sample in the DP4 field in the sample columns.
  • final.rna_editing.stats.txt: The sum of samples with a non-reference genotype per site.
  • final.rna_editing.proportions.txt: The proportion of non-reference/reference reads per sample per site.

Using ANOVA, we discovered 11 population specific RNA editing events within these sites. Boxplots showing the different proportions per population can be found here: File:Proportions boxplots.pdf.

2012-04-30 Variant calling from RNA-Seq identifies RNA Editing

by M.Sammeth

We employed the SNAPE pipeline (developed at the CNAG by E.Raineri and F.Castro) to call variants from five RNA-Seq samples that have been sequenced by all laboratories of the Geuvadis project. On average, we observe >30,000 heterogenous sites that coincide with SNPs predicted by complementary (exomic) DNA sequencing in the respective individual. Interestingly, we also found 59 RNA-polymorph sites that confirm RNA-editing events reported earlier to change protein-encoding sequences. Corresponding polymorphisms called from RNA-Seq are absent in the DNA-Seq and coincide the correct base substitutions (mainly A>G by the ADAR mechanism, observed as T>C for genes that are tanscribed from the reverse strand.

Most of these(36 events in the genes shown below) are high-fidelity calls (median quality >200 compared to <100 for low confidence calls) confirmed in at least 4/5 samples. These observations indicate that most RNA editing does not occur randomly distributed across individuals, but maintains a characteristic pattern. Most of the minority events stem from poorly expressed loci, where the ratio between sequencing depth and technical variability determines the success of variant calling.

Distribution and Confidence of RNA Editing Calls

2012-04-30 MS known RNA-edits.png

36 high-confidence RNA-edit events

The events below were confirmed by >80% of the samples (at least 4 out of 5 samples in the Geuvadis sandbox dataset). They have been reported earlier to change the coding sequence of the corresponding gene (cf. references below).

NEIL1 [1]; KCNAB1 [2]; HEXA, KLHL, the niban-like protein FAM129C, MECR, C5orf45, SMYD4, ZFYVE16, SYTL1, MAN2B2, SLC25A45, CPSF3L, BLCAP, NEIL1, ZBP1, GM2A, ZNF638, QPRT, NASP, SLC12A8, SYTL1, GLTSCR2, NCOA6, ZNF623, GPN2, FOXRED2, NDUFC2, GLTSCR2, CIITA, RNF207, GM2A, RPAP1, ZNF555, ASB16, CIITA, OTUD6B [3]; ZNF638, OTUD6B [4]; BLCAP [5]


[1] RNA editing changes the lesion specifity for the DNA repair enzyme NEIL1 (2010) Yeo J, Goodman RA, Schirle NT, David SS, Peter AB. PNAS
[2] Nervous system targets of RNA editing identified by comparative genomics (2003) Hoopengardner B, Bhalla T, Staber C, Reenan R. Science
[3] Widespread RNA Editing of Embedded AluElements in the Human Transcriptome (2004) Kim DDY, Kim TTY, Walsh T, Kobayashi Y, Matise TC, Buyske S, Gabriel A. Genome Research
[4] Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing (2009) Li JB, Levanon EY, Yoon JK, Aach J, Xie B, Leproust E, Zhang K, Gao Y, Church GM. Science
[5] A bioinformatic screen for novel A-I RNA editing sites reveals recoding editing in BC10 (2005) Clutterbuck DR, Leroy A, O’Connell MA, Semple CA. Nucl. Acids Res
Personal tools

RNAseq Data and Analysis
Admin and info