This repository is the working directory for the Garnet-Forest bundle of python scripts for analyzing diverse forms of 'omic' data in a network context.

fraenkel-lab Last update: Dec 02, 2021

OmicsIntegrator has moved. See OmicsIntegrator2. This codebase is not maintained.

Omics Integrator is a package designed to integrate proteomic data, gene expression data and/or epigenetic data using a protein-protein interaction network. It is comprised of two modules, Garnet and Forest.

Contact: Amanda Kedaigle [[email protected]]

Reference:

Network-Based Interpretation of Diverse High-Throughput Datasets through the Omics Integrator Software PackageTuncbag N^*, Gosline SJC^*, Kedaigle A, Soltis AR, Gitter A, Fraenkel E. PLoS Comput Biol 12(4): e1004879. doi:10.1371/journal.pcbi.1004879.

For a step-by-step protocol for running this software:Discovering altered regulation and signaling through network-based integration of transcriptomic, epigenomic and proteomic tumor dataKedaigle A, and Fraenkel E. Cancer Systems Biology: Methods in Molecular Biology, 2018.

System Requirements:

Python 2.6 or 2.7 (3.x version currently untested) and the dependenciesbelow. We recommend that users without an existing Python environmentinstall Anaconda (https://www.continuum.io/downloads) to obtain Python2.7 and the following required packages:

msgsteiner package (version 1.3): code, license
Boost C++ library: http://www.boost.org
Cytoscape for viewing results graphically (tested on versions 2.8-3.2):http://www.cytoscape.org

Features

Maps gene expression data to transcription factors using chromatinaccessibility data
Identifies proteins in the same pathway as hits using protein interactionnetwork
Integrates numerous high throughput data types to determine testablebiological hypotheses

Installation:

Omics Integrator is a collection of Python scripts and data files so can beeasily installed on any system. Steps 1 through 4 are only required for Forest,and you may skip to step 5 if you will only be running Garnet.

Boost is pre-installed on many Linux distributions. If your operating systemdoes not include Boost, follow the Boost getting startedguide forinstructions on how to download the library and extract files from the archive.To use the Homebrew package manager for Mac simply type brew install boost to install the library.
Download msgsteiner-1.3.tgz from http://staff.polito.it/alfredo.braunstein/code/msgsteiner-1.3.tgz (license)
Unpack files from the archive: tar -xvf msgsteiner-1.3.tgz
Enter the msgsteiner-1.3 subdirectory and run make

See this advice on compiling the C++ code if you encounter problems and this advice regarding compilation issues on OS X.
Make a note of the path to the compiled msgsteiner file that was created, which you will use when running Forest.
In Linux, use readlink -f msgsteiner in the msgsteiner-1.3 subdirectory to obtain the path.

Download the Omics Integrator package: OmicsIntegrator-0.3.1.tar.gz
Unpack files from the archive: tar -xvzf OmicsIntegrator-0.3.1.tar.gz
Make sure you have all the requirements using the pip tool by entering thedirectory and typing: pip install -r requirements.txt

Some users have reported errors when using this command to install matplotlib. To fix, install matplotlib independently (http://matplotlib.org) or use Anaconda as indicated above.

Now Omics Integrator is installed on your computer and can be used to analyzeyour data.

Examples

We provide many scripts and files to showcase the various capabilities of OmicsIntegrator. To run this:

Download the example files
Unpack by typing tar -xvzf OmicsIntegratorExamples.tar.gz in the distdirectory.
Move the unpacked files into the example directory.

For specific details about the examples, check out the READMEfile in the example directory.

Running garnet.py

Garnet is a script that runs a series ofsmaller scripts to map epigenetic data to genes and then scan the genome todetermine the likelihood of a transcription factor binding the genome near thatgene.

Usage: garnet.py [configfilename]  -s SEED, --seed=SEED  An integer seed for the pseudo-random number                        generators. If you want to reproduce exact results,                        supply the same seed. Default = None.Options:  -h, --help            show this help message and exit  --outdir=OUTDIR       Name of directory to place garnet output. DEFAULT:none  --utilpath=ADDPATH    Destination of chipsequtil library, Default=../src

Unlike Forest, the Garnet configuration file is a positional argument and must notbe preceded with --conf=. The configuration file should take the following format:

garnet input

[chromatinData]#these files contain epigenetically interesting regionsbedfile = bedfilecontainingregions.bedfastafile = fastafilemappedusinggalaxytools.fasta#these two files are provided in the packagegenefile = ../../data/ucsc_hg19_knownGenes.txtxreffile = ../../data/ucsc_hg19_kgXref.txt#distance to look from transcription start sitewindowsize = 2000[motifData]#motif matrices to be used, data provided with the packagetamo_file = ../../data/matrix_files/vertebrates_clustered_motifs.tamo#settings for scanninggenome = hg19numthreads = 4doNetwork = FalsetfDelimiter = .[expressionData]expressionFile = tabDelimitedExpressionData.txtpvalThresh = 0.01qvalThresh =[regression]#for generating and saving regression plotssavePlot=False

Chromatin Data

Many BED-formatted (bedfile) and FASTA-formatted (fastafile) files areincluded in the examples/ directory. bedfile can also be output from MACS(with a .xls extension) or GPS/GEM (with a .txt extension).To use your own epigenetic data, convert to BED and upload theBED-file to http://usegalaxy.org and select Fetch Alignments/Sequences from the leftmenu to click on Extract Genomic DNA. This will produce a FASTA-formatted filethat will work with garnet. We have provided gene (genefile) and xref(xreffile) annotations for both hg19 and mm9 - these files can be downloadedfrom http://genome.ucsc.edu/cgi-bin/hgTables if needed. The windowsizeparameter determines the maximum distance from a transcription start site toconsider an epigenetic event associated. 2kb is a very conservative metric.

motifData

We provide motif data in the proper TAMO format, the user just needs to enterthe genome used. The default numthreads is 4, but the user can alter thisdepending on the processing power of their machine. doNetwork will create aNetworkX object mapping transcription factors to genes, required input for theSAMNet algorithm. tfDelimiter is aninternal parameter to tell Garnet how to handle cases when many transcriptionfactors map to the sam binding motif.

expressionData

If the user has expression data to evaluate, provide a tab-delimited file underexpressionFile. File should have two columns, one containing the name of thegene and the second containing the log fold change of that gene in a particularcondition. We recommend only including those genes whose change in expression isstatistically significant. P-value (pvalThresh) or Q-value (qvalThresh)thresholds will be used to select only those transcription factors whosecorrelation with expression falls below the provided threshold.

regression

Linear regression plots are placed in a subdirectory named regression_plots ifsavePlot=True in the configuration file.

Garnet output

Garnet produces a number of intermediate files that enable youto better interpret your data or re-run a sub-script that may have failed. Allfiles are placed in the directory provided by the --outdir option of thegarnet script.

events_to_genes.fsa: This file contains the regions of the fastafileprovided in the configuration file that are within the specified distance to atranscription start site.
events_to_genes.xls: This file contains each region, the epigeneticactivity in that region, and the relationship of that region to the closestgene.
events_to_genes_with_motifs.txt: This contains the raw transcriptionfactor scoring data for each region in the fasta file.
events_to_genes_with_motifs.tgm: This contains the transcription factorbinding matrix scoring data mapped to the closest gene.
events_To_genes_with_motifs_tfids.txt: Names of transcription factors (orcolumns) of the matrix.
events_to_genes_with_motifs_geneids.txt: Names of genes (or rows) of thematrix.
events_to_genes_with_motifs.pkl: A Pickle-compressed Python Filecontaining a dictionary data structure that contains files 4-6 (under the keystgm,tfs, and genes) respectively as well as a delim key that describeswhat delimiter was used to separate out TFs in the case where there aremultiple TFs in the same family.
events_to_genes_with_motifsregression_results.tsv: Results from linearregression.
events_to_genes_with_motifsregression_results_FOREST_INPUT.tsv: Only thoseregression results that fall under the p-value or q-value significancethreshold provided in the configuration file, e.g. p=0.05, are included.This file can be used as input to Forest, and the prizes are -log2(pval)or -log2(qval).
regression_plots: An optional subdirectory that contains plots visualizingthe transcription factor linear regression tests.

Running forest.py

Forest requires the compiled msgsteiner package.

Usage: forest.py [options]Find multiple pathways within an interactome that are altered in a particularcondition using the Prize Collecting Steiner Forest problemOptions:  -h, --help            show this help message and exit  -p PRIZEFILE, --prize=PRIZEFILE                        (Required) Path to the text file containing the                        prizes. Should be a tab delimited file with lines:                        "ProteinName PrizeValue"  -e EDGEFILE, --edge=EDGEFILE                        (Required) Path to the text file containing the                        interactome edges. Should be a tab delimited file with                        3 or 4 columns: "ProteinA        ProteinB                        Weight(between 0 and 1) Directionality(U or D,                        optional)"  -c CONFFILE, --conf=CONFFILE                        Path to the text file containing the parameters.                        Should be several lines that looks like:                        "ParameterName = ParameterValue". Must contain values                        for w, b, D. May contain values for optional                        parameters mu, garnetBeta, noise, r, g. Default =                        "./conf.txt"  -d DUMMYMODE, --dummyMode=DUMMYMODE                        Tells the program which nodes in the interactome to                        connect the dummy node to. "terminals"= connect to all                        terminals, "others"= connect to all nodes except for                        terminals, "all"= connect to all nodes in the                        interactome. If you wish you supply your own list of                        proteins, dummyMode could also be the path to a text                        file containing a list of proteins (one per line).                        Default = "terminals"  --garnet=GARNET       Path to the text file containing the output of the                        GARNET module regression. Should be a tab delimited                        file with 2 columns: "TranscriptionFactorName                        Score". Default = "None"  --musquared           Flag to add negative prizes to hub nodes proportional                        to their degree^2, rather than degree. Must specify a                        positive mu in conf file.  --excludeTerms        Flag to exclude terminals when calculating negative                        prizes. Use if you want terminals to keep exact                        assigned prize regardless of degree.  --msgpath=MSGPATH     Full path to the message passing code. Default =                        "<current directory>/msgsteiner"  --outpath=OUTPUTPATH  Path to the directory which will hold the output                        files. Default = this directory  --outlabel=OUTPUTLABEL                        A string to put at the beginning of the names of files                        output by the program. Default = "result"  --cyto30              Use this flag if you want the output files to be                        amenable with Cytoscape v3.0 (this is the default).  --cyto28              Use this flag if you want the output files to be                        amenable with Cytoscape v2.8, rather than v3.0.  --noisyEdges=NOISENUM                        An integer specifying how many times you would like to                        add noise to the given edge values and re-run the                        algorithm. Results of these runs will be merged                        together and written in files with the word                        "_noisyEdges_" added to their names. The noise level                        can be controlled using the configuration file.                        Default = 0  --shuffledPrizes=SHUFFLENUM                        An integer specifying how many times you would like to                        shuffle around the given prizes and re-run the                        algorithm. Results of these runs will be merged                        together and written in files with the word                        "_shuffledPrizes_" added to their names. Default = 0  --randomTerminals=TERMNUM                        An integer specifying how many times you would like to                        apply your given prizes to random nodes in the                        interactome (with a similar degree distribution) and                        re-run the algorithm. Results of these runs will be                        merged together and written in files with the word                        "_randomTerminals_" added to their names. Default = 0  --knockout=KNOCKOUT   A list specifying protein(s) you would like to "knock                        out" of the interactome to simulate a knockout                        experiment, i.e. ['TP53'] or ['TP53', 'EGFR'].  -k CV, --cv=CV        An integer specifying the k value if you would like to                        run k-fold cross validation on the prize proteins.                        Default = None.  --cv-reps=CV_REPS     An integer specifying how many runs of cross-                        validation you would like to run. To use this option,                        you must also specify a -k or --cv parameter. Default                        = None.  -s SEED, --seed=SEED  An integer seed for the pseudo-random number                        generators. If you want to reproduce exact results,                        supply the same seed. Default = None.

Forest input files and parameters

Required inputs

The first two options (-p and -e) are required. You should record yourterminal nodes and prize values in a text file. The fileexample/a549/Tgfb_phos.txt is an example of what this file should look like.You should record your interactome and edge weights in a text file with 3 or 4columns. The file data/iref_mitab_miscore_2013_08_12_interactome.txt is ahuman interactome example (this interactome comes from iRefIndex v13, scored andformatted for our code).

A sample configuration file, a549/tgfb_forest.cfg is supplied. The user canchange the values included in this file or can supply their ownsimilarly formatted file. Unlike Garnet, the Forest configuration file name mustbe preceded with -c or --conf=.If the -c argument is not included in the command linethe program will attempt to read the default conf.txt. The parameters w, b, and Dmust be set in the configuration file. Optional parameters mu, garnetBeta, noise,g, and r may also be included. The processes and threads parametersboth provide parallelization. By default, Forest parallelizes tasksby running each network optimization task (e.g. for a different set of shuffledprizes or edge noise values) in a different, single-threaded process. Ifyou are not running Forest multiple times with cross validiation, shuffledprizes, or noisy edges, you may set processes = 1 and threads to thenumber of processors on your computer to run msgsteiner in a multi-threadedmanner.

w = float, controls the number of treesb = float, controls the trade-off between including more    terminals and using less reliable edgesD = int, controls the maximum path-length from v0 to terminal nodesmu = float, controls the degree-based negative prizes (defualt 0.0)garnetBeta = float, scales the garnet output prizes relative to the             provided protein prizes (default 0.01)noise = float, controls the standard deviation of the Gaussian edge        noise when the --noisyEdges option is used (default 0.333)g = float, msgsteiner reinforcement parameter that affects the convergence of the    solution and runtime, with larger values leading to faster convergence    but suboptimal results (default 0.001)r = float, msgsteiner parameter that adds random noise to edges,    which is rarely needed because the Forest --noisyEdges option    is recommended instead (default 0)processes = int, number of processes to spawn when doing randomization runs            (default to number of processors on your computer)threads = int, number of threads to use during msgsteiner optimization            (default 1)

For more details about the parameters, see our publication.

Optional inputs

The rest of the command line options are optional.

If you have run the garnet module to create scores for transcription factors,you can include that output file with the --garnet option and use garnetBeta in theconfiguration file to scale the garnet scores.

The --dummyMode option will change which nodes in the terminal are connectedto the dummy node in the interactome. We provide an example of this usinga549/Tgfb_interactors.txt. For an explanation of the dummy node, seepublication.

The --musquared option will apply negative prizes to nodes based on theirsquared degree, as opposed to linear degree. This is helpful if the defaultmu behavior is not strict enough to eliminate irrelevant hub nodes from yournetwork.

If the file msgsteiner is not in the same directory asforest.py, the path needs to be specified using the --msgpath option, e.g.,'--msgpath /home/msgsteiner-1.3/msgsteiner'.

If you would like the output files to be stored in a directory other than theone you are running the code from, you can specify this directory with the--outpath option. The names of the output files will all start with the wordresult unless you specify another word or phrase, such as an identifying labelfor this experiment or run, with the --outlabel option. The --cyto30 and--cyto28 tags can be used to specify which version of Cytoscape you would likethe output files to be compatiable with.

We include three options, --noisyEdges, --shuffledPrizes, and--randomTerminals to determine how robust your results are by comparing themto results with slightly altered input values. To use these options, supply anumber for either parameter greater than 0. If the number you give is more than1, it will alter values and run the program that number of times and merge theresults together. The program will add Gaussian noise to the edge values yougave in the -e option, or shuffle the prizes around all the network proteinsin the -p option, or assign the prizes to network proteins with similardegrees as your original terminals, according to which option you use. In--noisyEdges, Gaussian noise with mean 0 and standard deviation specified bythe parameter noise in the configuration file (default 0.333) will be addedto the edge scores. The results from these runs will be stored in seperate filesfrom the results of the run with the original prize or edge values, and bothwill be outputted by the program to the same directory.

The knockout option can be used if you would like to simulate a knockoutexperiment by removing a node from your interactome. Specify your knockoutproteins in a list, i.e. ['TP53'] or ['TP53', 'EGFR'].

The -k and --cv options can be used if you would like to run k-fold crossvalidation. This will partition the proteins with prizes into k equalsubsamples. It will run msgsteiner k times, leaving one subsample of prizes outeach time. The --cv-reps option can be used if you would like to run k-foldcross validation multiple times, each time with a different random partitioningof terminals. If you do not supply --cv-reps but do provide a k, crossvalidation will be run once. Each time it is run, a file called<outputlabel>_cvResults_<rep>.txt will be created. For each of the kiterations, it will display the number of terminals held out of the prizesdictionary, the number of those that were recovered in the optimal network asSteiner nodes, and the total number of Steiner nodes in the optimal network.

The -s option will supply a seed option to the pseudo-random number generatorsused in noisyPrizes, shuffledPrizes, randomTerminals, and the optimization inmsgsteiner itself. If you want to reproduce exact results, you should supply thesame seed every time. If you do not supply your own seed, system time is used aseed.

Running forest

Once you submit your command to the command line the program will run. It willdisplay messages as it completes, letting you know where in the process you are.If there is a warning or an error it will be displayed on the command line. Ifthe run completes successfully, several files will be created. These files canbe imported into Cytoscape v.3.0 to view the results of the run. These fileswill be named first with the outputlabel that you provided (or result bydefault), and then with a phrase identifying which file type it is.

Forest output

info.txt contains information about the algorithm run, including any errormessages if there were any during the run.
optimalForest.sif contains the optimal network output of themessage-passing algorithm (without the dummy node). It is a Simple InteractionFormat file. To see the network, open Cytoscape, and click on File > Import >Network > File..., and then select this file to open. Click OK.
augmentedForest.sif is the same thing, only it includes all the edges inthe interactome that exist between nodes in the optimal Forest, even thoseedges not chosen by the algorithm. Betweenness centrality for all nodes wascalculated with this network.
dummyForest.sif is the same as optimalForest.sif, only it includes thedummy node and all edges connecting to it.
edgeattributes.tsv is a tab-seperated value file containing informationfor each edge in the network, such as the weight in the interactome, and thefraction of optimal networks this edge was contained in. Toimport this information into Cytoscape, first import the network .sif file youwould like to view, and then click on File > Import > Table > File..., andselect this file. Specify that this file contains edge attributes, rather thannode attributes, and that the first row of the file should be interpreted ascolumn labels. Click OK.
nodeattributes.tsv is a tab-seperated value file containing informationfor each node in the network, such as the prize you assigned to it andbetweenness centrality in the augmented network. To import this information intoCytoscape, first import the network .sif file you would like to view, and thenclick on File > Import > Table > File..., and select this file. Specify thatthis file contains node attributes, rather than edge attributes, and that thefirst row of the file should be interpreted as column labels. Click OK.

When the network and the attributes are imported into Cytoscape, you can alterthe appearance of the network as you usually would using VizMapper.

Testing

See the tests directory for instructions on testing Omics Integrator.

Third Party Code

See the 'LICENSE-3RD-PARTY' file for license information for:python-avl-tree by Pavel Grafov

Tags: