HSA: integrating multi-track Hi-C data for genome-scale reconstruction of 3D chromatin structure

Prerequisites

Main function for multi-track 3D fitting is written and tested under R 3.1.1. Make sure you have installed package ‘MASS’. If you want a real-time visualization, install the package rgl and uncomment the line 885 and 886 in R script cstruct1.R.

Required inputs include contact maps’ files, covariate files (if there is any bias-related covariates), output file and Indicator of Markov distance control.

Format of contact map file:
Each
contact map file contains one contact map. Suppose there are n loci, contact map file should contain a matrix with n rows and n+2 column, in which:
column 1:  start position of each locus
column 2:  end position of each locus
column 3 to n+2: the n*n contact map in matrix form with all entries nonnegative.

Format of covariate file:
Each file contains the covariate's for one
contactmap
Number of rows should be the same as its corresponding contact map file. The first two columns are the same as those in contact map file. Columns after 2 are bias related features in each loci such as effective length, GC content and mappability score.
If the input contact map is already normalized, covariate file is not needed.

Output file: file name plus directory of where output will be saved. PLEASE NOTE THAT THIS TERM CAN NOT BE OMITTED!!!

Indicator of Markov distance control: 1 for using Markov distance control and 0 not using. We recommend it be set to 1 if the percentage of nonzero entries in contact map is lower than 10%.

Opitional input:


Inital 3D structure coordinates:
Initial 3D structure if the user has one he want to input, null by default. This input is recommended in high resolution 3D structure reconstruction.

Run HSA

Command format:

-Runing one multi-track fitted structure in R console:

    Open R, after importing related contact maps, covariate files, and output file's directory and name, type:
    >source("yourpath/hsacode/cstruct1.R")
    >
fmain(lsmap0,lscov0,outputfile,Maxiter,submaxiter,lamda,Leapfrog,epslon,mkfix,rho,mk,initialS)
   

   input annotation:
   lsmap0: list(map1,…,mapK). A list containing contact maps to be modelled,
   Suppose there are n loci in a contact map1, then map format1 is a matrix with n rows and n+2 column, in which:
   column 1:  start position of each locus
   column 2:  end position of each locus
   column 3 to n+2: the n*n contact map in matrix form with all entries nonnegative.

   lscov0:list(listcov1,…,listcovK) or 0 if no bias-correction covariate is need.
   list(listcov1,…,listcovK), a list containing bias covariates for each contact map.
   e.g. Suppose there are 3 covariates, listcovk is like list(covmatk_1,..., covmat1_K),
   covmatk_1~covmatk_3 are n*n matrices in which n is the number of rows in mapk.
 
   output: output filename, this term can NOT be omitted.

   Maxiter(50~1000), submaxiter(100~150), lamda (
submaxiter/4~submaxiter/3), Leapfrog(20~50), epslon(0.0003~0.02): tuning parameters for maximum iteration numbers, HMC sampled length, Leapfrog steplength/stepsize, etc, can be set by users. The range in each bracket is the value range we use empirically for reference. The larger the Maxiter, submaxiter, and Leapfrog's values are, the longer the time it takes to run. The larger the epslon is, the more radical the sampling process will be, and the less smooth the structure is updated. You might need to try a few epslon values to see which one fits your data. Do NOT use too large epslon. You can also use real-time visualization as mentioned above to tune these parameters and decide when to terminate.
  
  
mkfix: 0 if use default markov coeffients and 1 if estimating from data. Default: 0
  
   rho: ranges in [0,1) for tunning the kinetic and momentum in Hamiltion dynamic. Default: 0
  
   mk: use Markov distance control (1) or not (0).We recommend it be set to 1 if the percentage of nonzero entries in contact map is low than 10%.

   initialS: Initial 3D structure if the user has one he or she wants to input. Null by default. If the input structure has the same number of loci as contact maps, this initalS can be just a 3-column coordinates. If the loci in this structure are not the same as contact map, this variable should have five columns, with the first two colunms specifying the start and end locations of each locus and the last three as 3D coordinates.
  
   Other options:
  
   coarsefit: use a larger stepsize in Leapfrog at the beginning, Default=True. Note that large stepsize is more likely to generate outliers and NA values that cause error in computation. If that happens, try coarsefit=F.
  
   rmoutlier: remove outliers in the output structure (T) or not (F). Default=F.
  
   fitmode: 0 (Defualt) for a radical fitting pattern that achieves higher loglikelihood faster but is more prone to outliers and computation overflow, and 1 for a gentler yet also slower mode. 

-
Runing one multi-track fitted structure by Rscript command:
    
     After setting up the paramters to what you want in the fmain command in the myR.R file,

     if input contain covariates:

     Rscript myR.R contactmap_file1 …  contactmap_fileK covariate_file1 …  covariate_fileK outputfile Markov_Indicator Initial_structure_file(if available)

     if no input for covariates:
     Rscript myR.R contactmap_file1 … ,contactmap_fileK 0 outputfile Markov_Indicator Initial_structure_file(if available)

For contact maps at highe resolution (e.g., higher than 200kb or more than 1000 loci), we recommend using the reconstructed 3D structure from its lower resolution counterpart as the initial structure. If its lower resolution contact map is not available, a quick substitute is a sub contact map derived from extracting equally spaced columns and rows from the high resolution map.

Interpret the results

Output includes:
outputfile.txt:
column 1:  start position of each locus
column 2:  end position of each locus
column 3 to 5: 3D coordinates of structure with maximum likelihood ever occurred in iteration. This optimal structure is stored and updated at each iteration thus can be checked any time during the fitting process. If you think the output is already good enough at some point, you can terminate the algorithm and do not have to wait untill the end of fitting process.

outputbeta.txt:
Estimated coefficients in GLM. sorted as intercept_1, coefficient of bias covariates_1, coefficient of spatial distance_1, … , intercept_K, coefficient of bias covariates_K, coefficient of spatial distance_K, in one vector.

outputtemp.txt:
column 1:  start position of each locus
column 2:  end position of each locus
column 3 to 5: 3D coordinates of structure sampled at latest iteration.