The User's Guide of the online tool ToFiGAPS

ToFiGAPS is an online Tool for Finding Genomic And Protein Sequences of any genomic or protein regions. It is a web interface to the R package geno2proteo that we created. By using this online tool, a user is able to perform the following four tasks:

  1. Given a list of genomic regions specified by their coordinates on genome, find the DNA sequences of these regions.
  2. Given a list of genomic regions, find the protein sequences which are coded by these regions. Note that it also retrieves the exact DNA sequences which code those protein sequences.
  3. Given a list of protein regions specified by their coordinates on the protein, find the genomic coordinates in the corresponding genome which code those protein regions.
  4. Given a list of protein regions by their coordinates in protein, find the amino acid sequences of these protein regions.

In order to perform one task, a user simply go to the tool's main web page and follow the four steps:

  1. Choose one of the four tasks that the user wants to perform.
  2. Select the genome that the user wants to use.
  3. Provide a list of genomic regions or protein regions as input, depending on the specific task which the user has chosen
  4. Finally click the Submit button.

It normally takes around less than one to a few minutes for the web server to finish the task, and then the user is able to see the results in the result box at the bottom of the main web page and download the result file from the result box as well.

In the following we explain the input and output data and their formats that are accepted by this tool. For a detailed description of the methods behind the tool, please see the documentation of the R package geno2proteo.

First note that we follow the conventions of representing the genomic regions, DNA sequences, and protein's amino acid sequence. In particular, for a genomic region, the start genomic coordinate is always less than or equal to that of the end, regardless of the strand information. A DNA sequence is always from 5' end to 3' end, using the strand information. A protein region's sequence is from N-terminus to C-terminus of the protein.

Input

The input is either some genomic regions or some protein regions. The first two tasks need the input of a list of genomic regions, and the other two tasks need the input of a list of protein regions. You can put the input data into a text file and upload the file. Alternatively, you can copy and paste a list of regions into the input text box, one region per line.

A genomic region is represented by four items, chromosome, genomic coordinate of start, genomic coordinate of end, and strand, separated by either white space or tab. Note that, if the strand information was not given in a list of regions, '+' strand will be assumed by the tool for all the regions in the list. The following is a list of genomic regions in the format that the online tool requires:

chr1 23394882 23395709 +
chr16 9856551 9863407 -
chrX 207170 208356 +

The UCSC genome browser format of specifying chromosome and genomic coordinates of start and end is also accepted for representing genomic region:

chr1:23,394,882-23,395,709
chr16:3,166,431-3,166,529
chrX:3,169,751-3,169,819

A protein region is represented by three items, protein Ensembl ID, start and end coordinates along the protein, separated by either white space or tab. The following is a list of genomic regions in the format which the online tool requires:

ENSP00000371627 123 155
ENSP00000380077 279 301
ENSP00000403892 41 127

A more detailed description of the representation of genomic regions and protein regions that this online tool accepts are in the following.

Within a specific genome, a genomic region is represented by the name and the strand of the chromosome where the region is located, and the region’s start and end coordinates along the chromosome. Note that the start coordinate is always less than or equal to the end coordinate. The strand of chromosome is denoted as “+” and “-” for forward and reverse strand, respectively. One example of genomic region in human genome is “chr22 20127078 20127380 +”, which can also be represented as “22 20127078 20127380 +”. Note that the chromosome name in the input data can be either in the Ensembl style, e.g. 1, 2, 3, . . . , and X, Y and MT, or in another popular style, namely chr1, chr2, chr3, . . . , and chrX, chrY and chrM. But they cannot be mixed in one input.

A protein region refers to a section in the protein consisting of the consecutive amino acids. It is represented by the Ensembl ID of either the protein or the corresponding transcript, and the start and end coordinates along the protein. So the representation has two formats from which you can choose one at your convenience. For example, a region from the 91st amino acid to the 163rd amino acid in the protein corresponding to the transcript ZDHHC8-004 is represented as either “ENSP00000412807 91 163” or “ENST00000436518 91 163”, where ENSP00000412807 and ENST00000436518 are the Ensembl ID of the protein and the transcript of ZDHHC8-004, respectively, in the Ensembl Homo sapiens GRCh37 V74 genome. Note that you can use one format or the other, but have to use the same format in one input file.

Output

The output of the analysis performed by the tool is dependent on the type of task chosen. Below are the descriptions of the output for each of the four tasks. But please note that all the DNA sequences obtained are always from 5' to 3' of the regions in either + or - strand.

  1. Find the DNA sequences of any genomic regions:
        The results contains the list of the original genomic regions as in the input, and after them, an added column for the DNA sequence of the corresponding genomic region. An example of the output of this task on three genomic regions are:

    chromosome start end strand dnaSeq
    chr1 23394982 23395000 + GTTATGTTTAGTTTTATAA
    chr16 9856551 9856555 - AAAAT
    chrX 207170 207180 + CCAGCACATAC

  2. Find the protein sequences of any genomic regions:
        The results contains the list of the original genomic regions and after them, five added columns:     Note that if according to the chosen gene annotations a genomic region codes more than one different protein sequences, then those different protein sequences will be displayed in different lines, together with the same genomic region in each line. Therefore it is possible that the output has more lines than the original input list.
         The following are the output of the task for three genomic regions. Note that there is no result for the 3rd genomic region (represented by five NAs), because this region does not overlap with any protein coding region in the genome.

    chromosome start end strand transId dnaSeq dnaBefore dnaAfter pepSeq
    chr1 23395000 23395050 + ENST00000356634;ENST00000400181;
    ENST00000542151
    GTTCCTAAAGAGAAAGATG AA VPKEKDE
    chr16 9857914 9857924 - ENST00000330684;ENST00000396573;
    ENST00000396575;ENST00000404927;
    ENST00000535259;ENST00000562109
    CAAGGGGGACT CG CC RKGDS
    chrX 207170 207185 + NA NA NA NA NA

  3. Find the genomic coordinates of any protein regions:
        The results contains the list of the original protein regions and before them, six added columns for the genomic positions of the protein regions:     The following are some examples of the output of this task for three protein regions:

    chromosome start end strand start_exon end_exon Id start end
    chr16 3166431 3166529 + Exon_4 Exon_4 ENSP00000371627 123 155
    chr16 3274174 3274242 - Exon_4 Exon_4 ENSP00000380077 279 301
    chr16 3188540 3190823 + Exon_1 Exon_3 ENSP00000403892 41 127

  4. Find the amino acid sequences of any protein regions:
        The results contains the list of the original protein regions and after them, an added column for the amino acid sequences of the protein regions. The following are some examples of the output of this task for three protein regions:

    Id start end pepSeq
    ENSP00000371627 123 155 PVTFEDVALYLSREEWGRLDHTQQNFYRDVLQK
    ENSP00000380077 279 301 YDCNHCGKSFNHKTNLNKHERIH
    ENSP00000403892 41 55 AGPVALGDIPFYFSR