A2 Variant Calling
Variant calling at known RNA editing sites
by T. Wieland
Since multisample variant calling for all possible sites is still going on, we decided to also call the genotypes at a set of known RNA editing sites separately. The list of known sites was taken from the UCSC RNA Editing track which uses information from a database called DARNED and includes 42,039 positions.
We performed multisample calling using SAMtools across all available 462 Samples that weren't excluded by QC steps. For performance reasons we divided the editing sites per chromosome and ran the calling for the chromosomes in parallel. The call for chr1 looked like this:
samtools \ mpileup -C0 -m3 -F0.0002 \ -E \ -l chr1.darned.bed \ -d999999 \ -q20 \ -DSuf hg19.fa \ -b inputBams \ | bcftools view -cgv \ -l chr1.darned.bed \ - 2> chr1mpileup.log | vcfutils.pl varFilter -Q25 -d3 -D4999500 -a2 -w10 -W10 -10.0001 -21e-400 -30 -40.0001 -p > chr1not_filtered.vcf 2> chr1filtered.vcf
As you can see, the called variants were directly piped into the
varFilter script that comes with SAMtools and filters for several parameters like minimum/maximum read depth, Hardy-Weinberg Equilibrium, ... A short description of the parameters, including standard values, can be obtained by running the script without any parameters:
Usage: vcfutils.pl varFilter [options] <in.vcf> Options: -Q INT minimum RMS mapping quality for SNPs  -d INT minimum read depth  -D INT maximum read depth  -a INT minimum number of alternate bases  -w INT SNP within INT bp around a gap to be filtered  -W INT window size for filtering adjacent gaps  -1 FLOAT min P-value for strand bias (given PV4) [0.0001] -2 FLOAT min P-value for baseQ bias [1e-100] -3 FLOAT min P-value for mapQ bias  -4 FLOAT min P-value for end distance bias [0.0001] -e FLOAT min P-value for HWE (plus F<0) [0.0001] -p print filtered variants Note: Some of the filters rely on annotations generated by SAMtools/BCFtools.
The filtered variants were not discarded immediately, but only marked as filtered in the VCF file and kept for further analysis. Altogether SAMtools called non-reference variants (i.e. at least one sample had a non-reference genotype) at 24,680 sites. Please Note: SAMtools calls non-reference genotypes also at sites without sufficient coverage, sometimes even at sites without reads at all, due to some kind of "local imputation" (See this paper for details).
To reduce the number false positive RNA editing events we applied a set of very stringent filters:
- Minimum median coverage: We required a minimum median coverage of 10 at called sites.
- Minimum number of non-reference genotypes: At least 10 samples had to have a non-reference genotype at each site.
- varFilter passed: We required a variant to not be filtered out by the
- Minimum variant quality: We required a minimum variant quality (assigned by SAMtools) of 100
- No genomic variant: To ensure that the observed variants are true RNA editing events and not already present at the DNA level, we required two things:
- There should not be a known genetic variant at this position, i.e. there should not be a corresponding variant in the Genetic variation data.
- The reason that there is no corresponding variant in the genotype files might not be that there is no variant, but simply that it was not possible to call it due to bad coverage or quality of the region. The 1000 Genome Project offers files that state which regions of the human genome are accessible to current sequencing techniques by looking for regions in their data that fulfill certain quality criteria. Here we required the putative RNA editing events to lie within a region of the strict mask, because we assumed that if a variant was no RNA editing event but a DNA variant it would show up in the genotype files.
Final variant sets
After applying these filters only 113 variants remained. Currently there are three output files that can be found on the FTP server:
- final.rna_editing.vcf: The final set of variants. It includes the number of reference/non-reference reads per sample in the DP4 field in the sample columns.
- final.rna_editing.stats.txt: The sum of samples with a non-reference genotype per site.
- final.rna_editing.proportions.txt: The proportion of non-reference/reference reads per sample per site.
Using ANOVA, we discovered 11 population specific RNA editing events within these sites. Boxplots showing the different proportions per population can be found here: File:Proportions boxplots.pdf.
2012-04-30 Variant calling from RNA-Seq identifies RNA Editing
We employed the SNAPE pipeline (developed at the CNAG by E.Raineri and F.Castro) to call variants from five RNA-Seq samples that have been sequenced by all laboratories of the Geuvadis project. On average, we observe >30,000 heterogenous sites that coincide with SNPs predicted by complementary (exomic) DNA sequencing in the respective individual. Interestingly, we also found 59 RNA-polymorph sites that confirm RNA-editing events reported earlier to change protein-encoding sequences. Corresponding polymorphisms called from RNA-Seq are absent in the DNA-Seq and coincide the correct base substitutions (mainly A>G by the ADAR mechanism, observed as T>C for genes that are tanscribed from the reverse strand.
Most of these(36 events in the genes shown below) are high-fidelity calls (median quality >200 compared to <100 for low confidence calls) confirmed in at least 4/5 samples. These observations indicate that most RNA editing does not occur randomly distributed across individuals, but maintains a characteristic pattern. Most of the minority events stem from poorly expressed loci, where the ratio between sequencing depth and technical variability determines the success of variant calling.
Distribution and Confidence of RNA Editing Calls
36 high-confidence RNA-edit events
The events below were confirmed by >80% of the samples (at least 4 out of 5 samples in the Geuvadis sandbox dataset). They have been reported earlier to change the coding sequence of the corresponding gene (cf. references below).
NEIL1 ; KCNAB1 ; HEXA, KLHL, the niban-like protein FAM129C, MECR, C5orf45, SMYD4, ZFYVE16, SYTL1, MAN2B2, SLC25A45, CPSF3L, BLCAP, NEIL1, ZBP1, GM2A, ZNF638, QPRT, NASP, SLC12A8, SYTL1, GLTSCR2, NCOA6, ZNF623, GPN2, FOXRED2, NDUFC2, GLTSCR2, CIITA, RNF207, GM2A, RPAP1, ZNF555, ASB16, CIITA, OTUD6B ; ZNF638, OTUD6B ; BLCAP 
 RNA editing changes the lesion specifity for the DNA repair enzyme NEIL1 (2010) Yeo J, Goodman RA, Schirle NT, David SS, Peter AB. PNAS
 Nervous system targets of RNA editing identified by comparative genomics (2003) Hoopengardner B, Bhalla T, Staber C, Reenan R. Science
 Widespread RNA Editing of Embedded AluElements in the Human Transcriptome (2004) Kim DDY, Kim TTY, Walsh T, Kobayashi Y, Matise TC, Buyske S, Gabriel A. Genome Research
 Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing (2009) Li JB, Levanon EY, Yoon JK, Aach J, Xie B, Leproust E, Zhang K, Gao Y, Church GM. Science
 A bioinformatic screen for novel A-I RNA editing sites reveals recoding editing in BC10 (2005) Clutterbuck DR, Leroy A, O’Connell MA, Semple CA. Nucl. Acids Res