SeqFold: Genome-scale reconstruction of RNA secondary structures integrating experimental measurements

Prerequisites

1. Prepare data files from one of the following experiments.

PARS:
A tab-delimited file of read counts for RNase S1, and another for V1. One transcript per row. Format: transcript name (column 1), counts (column 2, semicolon separated, the raw number of reads obtained for each base). Example S1 and V1

SHAPE(-Seq):
A
space-delimited file of SHAPE reactivities for each base of a transcript. One base per row. Format: base number (column 1), SHAPE reactivity (column 2). Example

FragSeq:
A folder with cutting score files (*.cutscores.ss.list) output by FragSeq_v0.0.1. Example

2. Prepare a sequence file in FASTA format. Example

3. Download and install Sfold following its description.

Run SeqFold

1. Generate structure preference profiles

PARS:
python pars2spp.py s1_file v1_file outfile_prefix

SHAPE(-Seq):
python shape2spp.py shape_file outfile_prefix

FragSeq:
python fragseq2spp.py path_to_cuttingscore_folder outfile_prefix

Each command will output a file with structure preference profiles. One transcript per row. Format: transcript name (column 1), structure preferences (column 2, semicolon separated, the preference of single-strandness for each base).

2. Generate sample structures and clusters for each transcript

perl sfold_wrapper.pl sfold_executable_file input_fasta_file sfold_output_directory

Note: In a parallel computing enviornment, one can speed up the runtime by mondifying the value of $para in sfold_wrapper.pl. Example: $para = "bsub -M 3072000 -W 6:00";

3. Generate RNA secondary structure predictions and base-level accessibilities

python seqfold.py sfold_output_directory structure_preference_profile

Optional parameters
-d    Output directory. Default: ./
-o    Prefix of output summary files. Default: out
-f     Cutoff for sequences to be filtered with <= cutoff_frac fraction of sites having experimental data. Default: 0

Interpret the results

*.seqfold.ct
The predicted secondary structure of each transcript in CT format. One transcript a file
.

out.acc
The accessibility of each base of each transcript. One transcript per row. Format: transcript name (column 1), accessibilities (column 2, semicolon separated, the accessibility of each base)