Basic miRNA analyses

From Geuvadis MediaWiki
Jump to: navigation, search

This is a brief description at of the miRNA analyses which can be downloaded from ftp://ftp-private.ebi.ac.uk/upload/geuvadis/wp4_rnaseq/main_project/analysis_data/mirna/. The analyses were performed by the CRG SMARTAR (small RNA transcriptome analyzer) pipeline.



PHRED_SCORES


Description:


The PHRED quality scores of the first 36 nucleotides of each dataset. The scores are calculated for every nucleotide position in every data set (min, mean, max). These quality scores are presented either in table format or graphical format.

Method:


To ensure that all sRNA datasets are comparable, datasets with read lengths longer than 36 nts were trimmed using the FASTX suite from the Greg Hannon lab (http://hannonlab.cshl.edu/fastx_toolkit/), using the following command: fastx_trimmer -f 1 -l 36 -Q33. Tables and graphics were generated with the command: fastq_quality_boxplot_graph.sh -p.




FASTQ_CLIPPED


Description:


These are the FASTQ files after quality filtering and adapter clipping.

Method:


Reads were trimmed to length 36 nts as in PHRED_SCORES section. Trimmed homo-polymer reads were removed with this command: fastx_artifacts_filter -Q33 and remaining reads with low PHRED scores removed with this line: fastq_quality_filter -q 10 -p 50 -Q33. Reads were removed by identifying perfect matched to the first 8 nucleotides of the 3' adapter 'TGGAATTC'. This search starts at position 18 in the read. If there are no matches to the first 8 nt, then matches to the first 7 nt of the adapter are searched in the last seven nt of the read, then matches of the first six to the last six positions and so on. When a match is first found, the match to the adapter sequence and all nucleotides downstream are clipped from the read, and the next read is searched. Reads that have no matches are retained, but not clipped.




FASTA_COLLAPSED


Description:


These are the FASTA files after quality filtering and adapter clipping. Further all identical reads have been collapsed, constituting single FASTA entries. Each FASTA identifier contains three fields separated by underscores '_'. The first field is a three-character tag which unique identifies the sample. The second is a running number to ensure that each entry in a sample is unique. The last field, following an 'x' indicates how many times the sequence was detected in the sample.

Method:


Reads were trimmed and homo-polymer reads and low-quality reads removed as in FASTQ_CLIPPED. Adapters were clipped using the AdRec.jar program from the seqBuster suite (http://sourceforge.net/apps/mediawiki/seqbuster/) with the following options: java -jar AdRec.jar 1 8 0.3. If there are no matches to the first 8 nt, then matches to the first 7 nt of the adapter are searched in the last seven nt of the read, then matches of the first six to the last six positions and so on. When a match is first found, the match to the adapter sequence and all nucleotides downstream are clipped from the read, and the next read is searched. Reads that have no matches are retained, but not clipped. Last, reads shorter than 18 nts were discarded.





CROSS_SPECIES_CONTAMINATION


Description:


For each dataset, reads corresponding to clade-specific miRNAs were identified. Clade-specific miRNAs are defined as those which are specific to one of nine clades: sponges, nematode, insects, lophotrochozoa, echinoderms, fish, birds/reptiles, rodents or primates. In the table format, the following information is given. In the line with the clade name, the number before the slash '/' indicates the number of clade-specific miRNA families identified in the dataset. The number after the slash indicates the total number of known clade-specific miRNA families for the clade. The last number indicates the summed number of reads in the data corresponding the miRNA families specific to the clade. The following lines with two-column format gives the name of a clade-specific miRNA family (first number) and the number of reads corresponding to this family (second number). The pie-chart graphics gives breakdown of clade-specific miRNAs by clade. Importantly there was no evidence of cross-species contaminations in our data :)


Method:


A custom file with clade-specific miRNAs was generated (available upon demand). A given read was noted as corresponding to a clade-specific miRNA if the first twenty nucleotides were identical using custom source (available upon demand). The FASTA_COLLAPSED files were used for this analysis.




LENGTH_PROFILES


Description:


Tables and graphics indicating for each dataset how many reads were detected of the given length, after quality filtering and adapter clipping. The tables are in two-column format, with the first column indicating the length in nucleotides and the second indicating the read count. The figures supply the same information with length plotted vs. the count. Note that some reads which are indicated as being 35 nts might reflect molecules which are 36 nts or longer, this is a possible artifact of the adapter clipping (see FASTA_COLLAPSED section).

Method:


The FASTA_COLLAPSED files were used for generating length profiles, using custom source (available upon demand).




ANNOTATION_MIRS


Description:


These are the improved annotations of mature and hairpin precursor miRNAs, in FASTA and GTF format.


Method:


The annotation build upon miRBase version 18 annotations (http://www.mirbase.org/) but with significant improvements. In the cases where only one miRNA strand was annotated, the position and sequence of the other strand was estimated using RNA structure prediction. This analysis was performed by Henk Buermans from Leiden. Further, for the mature and hairpin miRNAs which overlapped structural variants, all variant sequences were generated. This analysis was performed by Tuuli Lappalainen from Unigen. Note that for a small number of miRNA strands, unambigous annotation was not possible.




EXPRESSION_MIRS


Description:


The miRBase_improved_counts.csv file supplies read counts for every mature miRNA and its variants over all datasets. Note that reads which map equally well to two or more miRNAs are counted fully towards each miRNA. The miRBase_improved_tpm.csv supplies the same information, but using a simple transcripts per million (TPM) normalization. That is, the read counts in each dataset is normalized to one million to facilitate comparisons between datasets of different sequencing depths. Of course, more complicated normalizations are possible and should be tested.


Method:


miRNA read counts were calculated using miraligner.jar from the seqBuster suite (http://sourceforge.net/apps/mediawiki/seqbuster/) using the following options: java -jar miraligner.jar 1 3 1. The FASTA_COLLAPSED files were used as were the improved miRNA annotations (see ANNOTATION_MIRS section).




READ_ANNOTATION_BREAKDOWN_DETAILED


Description:


Annotation breakdown of the FASTA_COLLAPSED sequencing reads in table format and graphics format. Annotations are ENCODE 8 (http://www.gencodegenes.org/) supplemented with rRNA and LINE and Alu transposon annotations from RepBase (http://www.girinst.org/repbase/) and snoRNA and miRNA annotations from UCSC table browser (http://genome.ucsc.eud/cgi-bin/hgTables). In addition, reads mapping to the mitochondrial genome are designated 'mitochondrion' and reads mapping to the genome of a known human viral pathogen (like EBV) are designated 'virus'. Reads mapping to the unassembled part of the human genome are considered 'rRNA'. Introns of protein coding genes were designated 'intron_coding' while introns of non-coding genes were designated 'intron_non_coding'. Last, all reads mapping to un-annotated positions, but with annotations on the same positions on the opposite strand were designated '_as'. Reads with no annotation were designated 'intergenic'.


Method:


Annotations were first collapsed so that each nucleotide on each strand had exactly one annotation. In case of nucleotides with more than one annotation, conflicts were resolved using a confidence-based floating hierarchy (as in Berninger et al., Computational analysis of small RNA cloning data. Methods 44, 13-21 (2008)). The hierarchy is: mitochondrion, virus, miRNA > snoRNA > rRNA > tRNA > snRNA > misc_RNA > lincRNA > processed_transcript > polymorphic_pseudogene > pseudogene > protein_coding > misc_RNA_pseudogene > scRNA_pseudogene > snRNA_pseudogene > L1 > Alu > intron_coding > intron_non_coding > _as > intergenic. The FASTA_COLLAPSED reads were mapped against the human genome (hg19) concatenated with the unassembled parts of the human genome and genomes of known human viral pathogens using the following option: bowtie -f -v 1 -a --best --strata. Each read mapping was weighted inversely to the number of genome mappings for the read, e.g. a read mapping to two genomic locations would get an assigned weight of 0.5. The weight of each read mapping was then counted towards the annotation of the nucleotide in the middle of the read mapping. For each annotation type, the assigned weight in a given dataset is then the sum of read mappings over all the nucleotides in the genome which the annotation occupies.




READ_ANNOTATION_BREAKDOWN


Description:


Annotation breakdown of the sequencing reads in table format and graphics format, as in READ_ANNOTATION_BREAKDOWN_DETAILED. However, a reduced number of 13 distinct annotations are used: mitochondrion, virus, miRNA, snoRNA, rRNA, tRNA, miscRNA, lncRNA, pseudogene, protein_coding, transposon, intergenic, intron. Further, in this annotation there is no distinction between the two genomic strands, e.g. a read mapping to antisense to a miRNA gene will be considered as a miRNA read.


Method:


As in READ_ANNOTATION_BREAKDOWN_DETAILED, but a reduced set of annotations were used. Specifically annotations were fused in this way: mitochondrion -> mitochondrion, virus -> virus, miRNA -> miRNA, snoRNA -> snoRNA, rRNA -> rRNA, tRNA -> tRNA, ( miscRNA, scRNA, snRNA ) -> miscRNA, ( lincRNA, processed_transcripts ) -> lncRNA, ( misc_RNA_pseudogene, polymorphic_pseudogenes, pseudogenes, scRNA_pseudogene, snRNA_pseudogene ) -> pseudogenes, protein_coding -> protein_coding, ( Alu, L1 ) -> transposon, ( intron_coding, intron_non_coding ) -> intron, intergenic -> intergenic. Further, each nucleotide can have only one annotation, so the hierarchy was used to resolve conflicts between annotations on opposite strands of a given nucleotide position.




EXPRESSION_FEATURES


Description:


Profiles expression of individual gene features, from protein coding exons to structural RNAs like snoRNAs, snRNAs and scRNAs. Every gene is tagged by a running number to keep the id unique. The 'expression_features.csv' spread-sheet supplies expression values across all samples, the 'annotation_features.gtf' file supplies hg19 coordinates of each gene feature. Note that this function of the pipeline has not been tested, and so expression values should be treated with some care. In particular, expression values for protein coding and lncRNA exons and miRNAs should not be used, as the ones produced by FLUX CAPACITOR and SMARTAR respectively are certainly more accurate. For structural RNAs, it is not straightforward how to normalize between samples. For instance, for normalizing expression of a given tRNA, would it be best to normalize to the number of mapped reads in each sample, or to the number of reads mapped to tRNAs in each sample? Probably orthogonal methods like qPCR should be applied to resolve such questions. We are testing this in the Estivill lab, but results will probably not be ready for some months.


Method:


The method used is the same as in READ_ANNOTATION_BREAKDOWN detailed, except that every that every gene feature is treated as an independent type of annotation. Only intergenic and intron features are not treated as individual types of annotation. Custom source is available upon demand.




BEDGRAPH


Description:


Custom tracks of read intensities over the genome. Can be uploaded to a genome browser. Intensities for the two genomic strands are separate. 'watson' for the '+' strand and 'crick' for the '-' strand. Intensities are capped at 100 for visualization purposes.


Method:


The FASTA_COLLAPSED reads were mapped to the hg19 genome as described in the READ_ANNOTATION_BREAKDOWN_DETAILED section. Each read mapping was weighted inversely to the number of genome mappings, e.g. a read mapping to two genomic locations would get an assigned weight of 0.5. Each nucleotide in the genome was then assigned a transcriptional intensity equal to the sum of weighted read mappings overlapping with it. This was done with custom source (available upon demand).




MAPPINGS_HAIRPINS


Description:


The FASTQ_CLIPPED reads mapped against the improved hairpin miRNA sequences (see ANNOTATIONS_MIR). Only unique mappings are considered. These mappings were generated for ASE analyses.


Method:


The FASTQ_CLIPPED reads were mapped against the hairpin miRNA sequences using this command line: bowtie -v 1 -a -m 1 --best --strata --norc --sam. The SAM mappings were parsed to BAM with this command line: samtools view -b -h -F 4 -S.




MAPPINGS_GENOME


Description:


The FASTQ_CLIPPED reads mapped against the hg19 genome. Only unique mappings are considered. It is not completely clear to me if these mappings will be useful in our analyses, probably researchers in our consortium will prefer to generate their own genome mappings using either FASTA_COLLAPSED or FASTQ_CLIPPED.


Method:


The FASTQ_CLIPPED reads were mapped against the hg19 genome using this command line: bowtie -v 1 -a -m 1 --best --strata --norc --sam. The SAM mappings were parsed to BAM with this command line: samtools view -b -h -F 4 -S.




STATISTICS


Description:


Table summarizing the number of reads in each data set before filtering, how many reads are quality filtered (homo-polymers and low PHRED scores), filtered by length (<18 nts), how many passed the filtering but were not successfully mapped (see READ_ANNOTATION_BREAKDOWN_DETAILED for mapping procedure) and how many reads were successfully mapped to the genome.


Method:


Custom source was used to sum up reads (available upon demand). The FASTA_COLLAPSED reads were used for these statistics.

Personal tools
Namespaces

Variants
Actions
Navigation
RNAseq Data and Analysis
Admin and info
Public
Toolbox