The User's Guide of the online tool ToFiGAPS

ToFiGAPS is an online Tool for Finding Genomic And Protein Sequences of any genomic or protein regions. It is a web interface to the R package geno2proteo that we created. By using this online tool, a user is able to perform the following four tasks:

Given a list of genomic regions specified by their coordinates on genome, find the DNA sequences of these regions.
Given a list of genomic regions, find the protein sequences which are coded by these regions. Note that it also retrieves the exact DNA sequences which code those protein sequences.
Given a list of protein regions specified by their coordinates on the protein, find the genomic coordinates in the corresponding genome which code those protein regions.
Given a list of protein regions by their coordinates in protein, find the amino acid sequences of these protein regions.

In order to perform one task, a user simply go to the tool's main web page and follow the four steps:

Choose one of the four tasks that the user wants to perform.
Select the genome that the user wants to use.
Provide a list of genomic regions or protein regions as input, depending on the specific task which the user has chosen
Finally click the Submit button.

It normally takes around less than one to a few minutes for the web server to finish the task, and then the user is able to see the results in the result box at the bottom of the main web page and download the result file from the result box as well.

In the following we explain the input and output data and their formats that are accepted by this tool. For a detailed description of the methods behind the tool, please see the documentation of the R package geno2proteo.

First note that we follow the conventions of representing the genomic regions, DNA sequences, and protein's amino acid sequence. In particular, for a genomic region, the start genomic coordinate is always less than or equal to that of the end, regardless of the strand information. A DNA sequence is always from 5' end to 3' end, using the strand information. A protein region's sequence is from N-terminus to C-terminus of the protein.

Input

The input is either some genomic regions or some protein regions. The first two tasks need the input of a list of genomic regions, and the other two tasks need the input of a list of protein regions. You can put the input data into a text file and upload the file. Alternatively, you can copy and paste a list of regions into the input text box, one region per line.

A genomic region is represented by four items, chromosome, genomic coordinate of start, genomic coordinate of end, and strand, separated by either white space or tab. Note that, if the strand information was not given in a list of regions, '+' strand will be assumed by the tool for all the regions in the list. The following is a list of genomic regions in the format that the online tool requires:

chr1 23394882 23395709 + chr16 9856551 9863407 - chrX 207170 208356 +

The UCSC genome browser format of specifying chromosome and genomic coordinates of start and end is also accepted for representing genomic region:

chr1:23,394,882-23,395,709 chr16:3,166,431-3,166,529 chrX:3,169,751-3,169,819

A protein region is represented by three items, protein Ensembl ID, start and end coordinates along the protein, separated by either white space or tab. The following is a list of genomic regions in the format which the online tool requires:

ENSP00000371627 123 155 ENSP00000380077 279 301 ENSP00000403892 41 127

A more detailed description of the representation of genomic regions and protein regions that this online tool accepts are in the following.

Within a specific genome, a genomic region is represented by the name and the strand of the chromosome where the region is located, and the region’s start and end coordinates along the chromosome. Note that the start coordinate is always less than or equal to the end coordinate. The strand of chromosome is denoted as “+” and “-” for forward and reverse strand, respectively. One example of genomic region in human genome is “chr22 20127078 20127380 +”, which can also be represented as “22 20127078 20127380 +”. Note that the chromosome name in the input data can be either in the Ensembl style, e.g. 1, 2, 3, . . . , and X, Y and MT, or in another popular style, namely chr1, chr2, chr3, . . . , and chrX, chrY and chrM. But they cannot be mixed in one input.

A protein region refers to a section in the protein consisting of the consecutive amino acids. It is represented by the Ensembl ID of either the protein or the corresponding transcript, and the start and end coordinates along the protein. So the representation has two formats from which you can choose one at your convenience. For example, a region from the 91st amino acid to the 163rd amino acid in the protein corresponding to the transcript ZDHHC8-004 is represented as either “ENSP00000412807 91 163” or “ENST00000436518 91 163”, where ENSP00000412807 and ENST00000436518 are the Ensembl ID of the protein and the transcript of ZDHHC8-004, respectively, in the Ensembl Homo sapiens GRCh37 V74 genome. Note that you can use one format or the other, but have to use the same format in one input file.

Output

The output of the analysis performed by the tool is dependent on the type of task chosen. Below are the descriptions of the output for each of the four tasks. But please note that all the DNA sequences obtained are always from 5' to 3' of the regions in either + or - strand.

Find the DNA sequences of any genomic regions:
The results contains the list of the original genomic regions as in the input, and after them, an added column for the DNA sequence of the corresponding genomic region. An example of the output of this task on three genomic regions are:

chromosome start end strand dnaSeq

chr1 23394982 23395000 + GTTATGTTTAGTTTTATAA

chr16 9856551 9856555 - AAAAT

chrX 207170 207180 + CCAGCACATAC

chromosome	start	end	strand	dnaSeq
chr1	23394982	23395000	+	GTTATGTTTAGTTTTATAA
chr16	9856551	9856555	-	AAAAT
chrX	207170	207180	+	CCAGCACATAC

Find the protein sequences of any genomic regions:
The results contains the list of the original genomic regions and after them, five added columns:

Column “transId” lists the Ensembl IDs of the transcripts whose coding regions overlap with locus specified and the overlapping coding regions are exactly the same among those transcripts.
Column “dnaSeq” contains the DNA sequence in the overlapping coding regions.
Column “dnaBefore” contains the DNA letters which are in the same codon as the first letter in the DNA sequence in the column “dnaSeq”.
Column “dnaAfter” contains the DNA letters which are in the same codon as the last letter in the DNA sequence in the previous column “dnaSeq”.
Column “pepSeq” contains the amino acid sequence translated from the DNA sequences in the three preceding columns, “dnaBefore”, “dnaSeq” and “dnaAfter”.

Note that if according to the chosen gene annotations a genomic region codes more than one different protein sequences, then those different protein sequences will be displayed in different lines, together with the same genomic region in each line. Therefore it is possible that the output has more lines than the original input list.
The following are the output of the task for three genomic regions. Note that there is no result for the 3rd genomic region (represented by five NAs), because this region does not overlap with any protein coding region in the genome.

chromosome	start	end	strand	transId	dnaSeq	dnaBefore	dnaAfter	pepSeq
chr1	23395000	23395050	+	ENST00000356634;ENST00000400181; ENST00000542151	GTTCCTAAAGAGAAAGATG		AA	VPKEKDE
chr16	9857914	9857924	-	ENST00000330684;ENST00000396573; ENST00000396575;ENST00000404927; ENST00000535259;ENST00000562109	CAAGGGGGACT	CG	CC	RKGDS
chrX	207170	207185	+	NA	NA	NA	NA	NA

Find the genomic coordinates of any protein regions:
The results contains the list of the original protein regions and before them, six added columns for the genomic positions of the protein regions:

The 1st, 2nd, 3rd and 4th columns show the chromosome name, start and end positions, and strand in the chromosome, which specify the genomic region which is translated to the protein region specified in the input.
The 5th and 6th columns shown the first and last exons in the context of coding exons in the transcript for the given protein region.

The following are some examples of the output of this task for three protein regions:

chromosome	start	end	strand	start_exon	end_exon	Id	start	end
chr16	3166431	3166529	+	Exon_4	Exon_4	ENSP00000371627	123	155
chr16	3274174	3274242	-	Exon_4	Exon_4	ENSP00000380077	279	301
chr16	3188540	3190823	+	Exon_1	Exon_3	ENSP00000403892	41	127

Find the amino acid sequences of any protein regions:
The results contains the list of the original protein regions and after them, an added column for the amino acid sequences of the protein regions. The following are some examples of the output of this task for three protein regions:

Id	start	end	pepSeq
ENSP00000371627	123	155	PVTFEDVALYLSREEWGRLDHTQQNFYRDVLQK
ENSP00000380077	279	301	YDCNHCGKSFNHKTNLNKHERIH
ENSP00000403892	41	55	AGPVALGDIPFYFSR