Installation
To get the code please see: DownloadTo setup rDiff, please follow these steps:
- Download the SAMTools (version 0.1.7) from http://samtools.sourceforge.net/ and install it. You need to add the flag -fPIC in the SAMTools Makefile for compilation.
- Change to directory that contains rDiff and configure rDiff by typing:
./configure
This will set the enivronemnt variables for the rDiff installation. You can also set these variables manually using the interactive configuration:./configure -i
- To compile rDiff and finish the installation, type:
make
- To test the installation and run a example you can enter:
make example
This will download an example dataset and run the rDiff.parametric on it. You can also run rDiff.parametric, rDiff.nonparametric or rDiff.poisson by typing:make threeexamples
Examples
rDiff can be used in various experimental settings.- Detecing differential relative transcript abundance when gene annotation is complete
- Detecing differential relative transcript abundance when gene annotation is incomplete
- Working without replicates
Using rDiff.parametric
When the gene structure is know we recommend using rDiff.parametric. This statistical test tests for difference in the relative abundance of annotated transcripts. rDiff.parametric requires as input bam files for each sample, as well as a GFF3 gene structure. In the following example we test for differences in the two samples "1" and "2" which have ther replicatesbam11.bam
, bam12.bam
resp. bam21.bam
, bam22.bam
. In our example we assume that the
bam files are located in the directory bamdir
and that the reads are 75 long. Futhermore we assume that our gene structure is saved in the file genes.gff3
in the GFF3-Format.
The test can then be started by first changing into the directory bin
:cd binand then typing:
./rdiff -o outdir -d bamdir -a bam11.bam,bam12.bam -b bam21.bam,bam22.bam -g genes.gff3 -m param -L 75 -m 30Here we required furthermore that a read has to be at least 30 bp long in order to be included in the analysis. A detailed description of the parameter used can be found int the following table:
Option | Description |
---|---|
-o | The output directory for the results |
-d | The directory where the bam files are |
-a | The filenames of the bamfiles in the first samples. The filenames must be separated by "," and without spaces. |
-b | The filenames of the bamfiles in the second samples. The filenames must be separated by "," and without spaces. |
-g | The filename of the gene structure. The filename must be absolute. |
-m | Method to be used for testing. The value 'param' is for rDiff.parametric, 'nonparam' for rDiff.nonparametric and 'poisson' for rDiff.poisson. |
-L | The read length of the reads |
-m | Minimal length of reads that should be used. Reads shorter than this will not be included in the analysis. |
The output files can be found in
outdir
. The output-files are described in the following table:Filename | Description |
---|---|
P_values_rDiff_parametric.tab | This file contains the p-values of rDiff.parametric. The file is tab-delimited and has three columns. The first column contains the gene names, the second the p-values and the third the test status. |
Gene_expression.tab | This file contains the gene expression estimations for all the replicates. The file is tab-delimited. The first column contains the gene names and the other columns the read counts for each gene for all replicates. |
Alternative_region_counts.mat | This file contains the counts for the alternative regions. The format is the binary mat format. |
genes.mat | This file contains the gene structure. The format is the binary mat format. |
variance_function_1.mat | This file contains the saved variance function for sample "1". It is a locfit-structure saved in the binary mat format. |
variance_function_2.mat | This file contains the saved variance function for sample "2". It is a locfit-structure saved in the binary mat format. |
Using rDiff.nonparametric
When the gene structure is incomplete we recommend using rDiff.nonparametric. This test looks for significant differences in read coverages. To run rDiff.nonparametricr requires as input the bam files for each samples as well as a GFF3 gene structure. rDiff.nonaprametric tries to estimate the biological variance on the annotated gene structure. Therefore, it is of advantage but not necessary to have a as complete gene structure as possible. Apart from the variance estimation rDiff.nonparametric uses only the gene starts and gene stop for testing.In the following example we test for dfferences in the two samples "1" and "2" which have ther replicates
bam11.bam
, bam12.bam
resp. bam21.bam
, bam22.bam
. In our example we assume that the
bam files are located in the directory bamdir
and that the reads are 75 long. Furthermore, we assume that our gene structure is saved in the file genes.gff3
in the GFF3-Format.
The test can then be started by first changing into the directory bin
:cd binand then typing:
./rdiff -o outdir -d bamdir -a bam11.bam,bam12.bam -b bam21.bam,bam22.bam -g genes.gff3 -m nonparam -L 75 -m 30Here we required furthermore that a read has to be at least 30 bp long in order to be included in the analysis. A detailed description of the parameter used can be found in the following table:
Option | Description |
---|---|
-o | The output directory for the results |
-d | The directory where the bam files are |
-a | The filenames of the bamfiles in the first samples. The filenames must be separated by "," and without spaces. |
-b | The filenames of the bamfiles in the second samples. The filenames must be separated by "," and without spaces. |
-g | The filename of the gene structure. The filename must be absolute. |
-m | Method to be used for testing. The value 'param' is for rDiff.parametric, 'nonparam' for rDiff.nonparametric and 'poisson' for rDiff.poisson. |
-L | The read length of the reads |
-m | Minimal length of reads that should be used. Reads shorter than this will not be included in the analysis. |
The output files can be found in
outdir
. The outputfiles are described in the following table:Filename | Description |
---|---|
P_values_rDiff_nonparametric.tab | This file contains the p-values of rDiff.nonparametric. The file is tab-delimited and has three columns. The first column contains the gene names, the second the p-values and the third the test status. |
Gene_expression.tab | This file contains the gene expression estimations for all the replicates. The file is tab-delimited. The first column contains the gene names and the other columns the read counts for each gene for all replicates. |
Nonparametric_region_counts.mat | This file contains the counts for the alternative regions used to estimate the variance functions. The format is the binary mat format. |
genes.mat | This file contains the gene structure. The format is the binary mat format. |
variance_function_1.mat | This file contains the saved variance function for sample "1". It is a locfit-structure saved in the binary mat format. |
variance_function_2.mat | This file contains the saved variance function for sample "2". It is a locfit-structure saved in the binary mat format. |
Working without replicates
When there is only one replicate available in each sample one can merge the replicates from both samples for the variance function estimation. This can be done using the option -x
additionally to the other options.
FAQ
- What are all the rDiff options doing?
- What does it mean if the test status is not "OK"?
- How can I make rDiff.nonparametric faster?
What are all the rDiff options doing?
The different options for rDiff are explained in the table bellow. Most of them are not required for the basic feature but can be used to adapt rDiff to your experimental setting.
Option Description -h Display the help -o This option takes as argument the output directory where the results should be. This is also where rDiff will save the other output files. -d Directory where the bam-files are located. If they are in in different directories this can be also left empty and the path to the bamfiles can be given as part of the bam-file names. -a This argument specifies which sample should be used for sample 1. It takes as argument a comma separated list of bam-files for sample 1. It is important not to have spaces between the files. The input should be of the form: File1.bam,File2.bam,...
-b This argument specifies which sample should be used for sample 2. It takes as argument a comma separated list of bam-files for sample 2. It is important not to have spaces between the files. The input should be of the form: File1.bam,File2.bam,...
-g Path to GFF3 gene structure -L Read length used for rDiff.parametric to compute the alternative regions. The default ist 75
bp. If the reads are longer or shorter rDiff will try to find the best match to an alternative region.-m This option takes as argument the method that should be used for testing. The default option is rDiff.parametric: -
param
for rDiff.parametric nonparam
for rDiff.nonparametricpoisson
for rDiff.poissonmmd
for rDiff.mmd
-M Minimal read length required. The default is 30
bp. The reads that are shorter are not used for the analysis.-e Skip the gene expression estimation. If the gene expression estimation step should be skipped enter 0
. The default is1
.-E Only estimate the gene expression and variance function estimation and do not perform testing. If you want to exit after the variance function estimation enter 0
. The default is1
.-A This option takes as argument the path to variance function for sample 1. This option can be used for example, if a previously computed variance function should be used. -B This option takes as argument the path to variance function for sample 2. This option can be used for example, if a previously computed variance function should be used. -S Filename under which variance function for sample 1 will be saved. -T Filename under which variance function for sample 2 will be saved. -P Using this option one can specify a parametric variance function for sample 1 of the form f(x)=a+b*x+b*x^2. The argument for this option is a,b,c
.-Q Using this option one can specify a parametric variance function for sample 2 of the form f(x)=a+b*x+b*x^2. The argument for this option is a,b,c
.-y Use only the gene start and stop for the rDiff.nonparametric variance function estimation. Enter 1
if this should be done and0
otherwise.-s This option allows to sample the reads down to a certain number. This increases the speed for highly covered genes The argument is number of reads per gene to which to down sample. The Default is 10000
.-C Number of bases to clip from each end of each read. This reduces the false mappings of spliced read ends. The default is 3
bp.-p Number of permutations performed for rDiff.nonparametric. The default is 1000
.-x Merge sample 1 and sample 2 for variance function estimation. Type 1
to merge the samples. The default is0
What does it mean if the test status is not "OK"?
This means that there was a problem when the testing. This can happen for example when there are not enough reads for testing.
How can I make rDiff.nonparametric faster?
You can either reduce the number of reads that should be sampled using the parameter using the option
-s
or reduce the number of permutation using the parameter-p
.Alternatively you can also parallelize rDiff.parametric by first estimating the variance functions. You can then split up the gene structure and test using the estimated gene expression and variance functions.
-