SeqFold: Genome-scale reconstruction of RNA secondary structures integrating experimental measurements


1. Prepare data files from one of the following experiments.

A tab-delimited file of read counts for RNase S1, and another for V1. One transcript per row. Format: transcript name (column 1), counts (column 2, semicolon separated, the raw number of reads obtained for each base). Example S1 and V1

space-delimited file of SHAPE reactivities for each base of a transcript. One base per row. Format: base number (column 1), SHAPE reactivity (column 2). Example

A folder with cutting score files (* output by FragSeq_v0.0.1. Example

2. Prepare a sequence file in FASTA format. Example

3. Download and install Sfold following its description.

Run SeqFold

1. Generate structure preference profiles

python s1_file v1_file outfile_prefix

python shape_file outfile_prefix

python path_to_cuttingscore_folder outfile_prefix

Each command will output a file with structure preference profiles. One transcript per row. Format: transcript name (column 1), structure preferences (column 2, semicolon separated, the preference of single-strandness for each base).

2. Generate sample structures and clusters for each transcript

perl sfold_executable_file input_fasta_file sfold_output_directory

Note: In a parallel computing enviornment, one can speed up the runtime by mondifying the value of $para in Example: $para = "bsub -M 3072000 -W 6:00";

3. Generate RNA secondary structure predictions and base-level accessibilities

python sfold_output_directory structure_preference_profile

Optional parameters
-d    Output directory. Default: ./
-o    Prefix of output summary files. Default: out
-f     Cutoff for sequences to be filtered with <= cutoff_frac fraction of sites having experimental data. Default: 0

Interpret the results

The predicted secondary structure of each transcript in CT format. One transcript a file

The accessibility of each base of each transcript. One transcript per row. Format: transcript name (column 1), accessibilities (column 2, semicolon separated, the accessibility of each base)