ToFiGAPS is an online Tool for Finding Genomic And Protein Sequences of any genomic or protein regions. It is a web interface to the R package geno2proteo that we created. By using this online tool, a user is able to perform the following four tasks:
In order to perform one task, a user simply go to the tool's main web page and follow the four steps:
It normally takes around less than one to a few minutes for the web server to finish the task, and then the user is able to see the results in the result box at the bottom of the main web page and download the result file from the result box as well.
In the following we explain the input and output data and their formats that are accepted by this tool. For a detailed description of the methods behind the tool, please see the documentation of the R package geno2proteo.First note that we follow the conventions of representing the genomic regions, DNA sequences, and protein's amino acid sequence. In particular, for a genomic region, the start genomic coordinate is always less than or equal to that of the end, regardless of the strand information. A DNA sequence is always from 5' end to 3' end, using the strand information. A protein region's sequence is from N-terminus to C-terminus of the protein.
The input is either some genomic regions or some protein regions. The first two tasks need the input of a list of genomic regions, and the other two tasks need the input of a list of protein regions. You can put the input data into a text file and upload the file. Alternatively, you can copy and paste a list of regions into the input text box, one region per line.
A genomic region is represented by four items, chromosome, genomic coordinate of start, genomic coordinate of end, and strand, separated by either white space or tab. Note that, if the strand information was not given in a list of regions, '+' strand will be assumed by the tool for all the regions in the list. The following is a list of genomic regions in the format that the online tool requires:
chr1 23394882 23395709 +The UCSC genome browser format of specifying chromosome and genomic coordinates of start and end is also accepted for representing genomic region:
chr1:23,394,882-23,395,709A protein region is represented by three items, protein Ensembl ID, start and end coordinates along the protein, separated by either white space or tab. The following is a list of genomic regions in the format which the online tool requires:
ENSP00000371627 123 155A more detailed description of the representation of genomic regions and protein regions that this online tool accepts are in the following.
Within a specific genome, a genomic region is represented by the name and the strand of the chromosome where the region is located, and the region’s start and end coordinates along the chromosome. Note that the start coordinate is always less than or equal to the end coordinate. The strand of chromosome is denoted as “+” and “-” for forward and reverse strand, respectively. One example of genomic region in human genome is “chr22 20127078 20127380 +”, which can also be represented as “22 20127078 20127380 +”. Note that the chromosome name in the input data can be either in the Ensembl style, e.g. 1, 2, 3, . . . , and X, Y and MT, or in another popular style, namely chr1, chr2, chr3, . . . , and chrX, chrY and chrM. But they cannot be mixed in one input.
A protein region refers to a section in the protein consisting of the consecutive amino acids. It is represented by the Ensembl ID of either the protein or the corresponding transcript, and the start and end coordinates along the protein. So the representation has two formats from which you can choose one at your convenience. For example, a region from the 91st amino acid to the 163rd amino acid in the protein corresponding to the transcript ZDHHC8-004 is represented as either “ENSP00000412807 91 163” or “ENST00000436518 91 163”, where ENSP00000412807 and ENST00000436518 are the Ensembl ID of the protein and the transcript of ZDHHC8-004, respectively, in the Ensembl Homo sapiens GRCh37 V74 genome. Note that you can use one format or the other, but have to use the same format in one input file.
The output of the analysis performed by the tool is dependent on the type of task chosen. Below are the descriptions of the output for each of the four tasks. But please note that all the DNA sequences obtained are always from 5' to 3' of the regions in either + or - strand.
chromosome | start | end | strand | dnaSeq |
---|---|---|---|---|
chr1 | 23394982 | 23395000 | + | GTTATGTTTAGTTTTATAA |
chr16 | 9856551 | 9856555 | - | AAAAT |
chrX | 207170 | 207180 | + | CCAGCACATAC |
chromosome | start | end | strand | transId | dnaSeq | dnaBefore | dnaAfter | pepSeq |
---|---|---|---|---|---|---|---|---|
chr1 | 23395000 | 23395050 | + | ENST00000356634;ENST00000400181; ENST00000542151 | GTTCCTAAAGAGAAAGATG | AA | VPKEKDE | |
chr16 | 9857914 | 9857924 | - | ENST00000330684;ENST00000396573; ENST00000396575;ENST00000404927; ENST00000535259;ENST00000562109 | CAAGGGGGACT | CG | CC | RKGDS |
chrX | 207170 | 207185 | + | NA | NA | NA | NA | NA |
chromosome | start | end | strand | start_exon | end_exon | Id | start | end |
---|---|---|---|---|---|---|---|---|
chr16 | 3166431 | 3166529 | + | Exon_4 | Exon_4 | ENSP00000371627 | 123 | 155 |
chr16 | 3274174 | 3274242 | - | Exon_4 | Exon_4 | ENSP00000380077 | 279 | 301 |
chr16 | 3188540 | 3190823 | + | Exon_1 | Exon_3 | ENSP00000403892 | 41 | 127 |
Id | start | end | pepSeq |
---|---|---|---|
ENSP00000371627 | 123 | 155 | PVTFEDVALYLSREEWGRLDHTQQNFYRDVLQK |
ENSP00000380077 | 279 | 301 | YDCNHCGKSFNHKTNLNKHERIH |
ENSP00000403892 | 41 | 55 | AGPVALGDIPFYFSR |