A: ProtaBank is freely available for all non-commercial purposes. Anyone can browse the database, submit data, and use the search and analysis tools. However, we ask that you sign up as a user first. The only required fields are those for your username, password and email, but additional information (phone, institution, department) is also helpful and will allow for more dialogue and collaboration between users.
A: Create an account to access additional search tools, submit studies, or to access the ProtaBank API. Creating an account is free! We use accounts to ensure that submitted data is associated with a particular user who can be contacted if issues arise.
A: Yes
A: ProtaBank allows all users to submit data, even if they did not generate it to ensure that all useful protein engineering data is captured. See a more detailed description of ProtaBank submission policies here.
A: Yes. The date your data will become available to other users for viewing, searching, etc. is specified when you submit the study to the database (see the tutorial). This date can be a maximum of six months from the current date. To extend the embargo beyond this period, you will need to contact ProtaBank support.
A: ProtaBank aims to collect any data, obtained experimentally or from computations/simulations, in which the amino acid sequence of a protein was modified to alter some property. Sequences obtained from de novo designs can also be submitted.
A: ProtaBank currently focuses only on the amino acid sequence, but we hope to collect DNA and RNA sequences soon!
A: All raw data, or parameters fit to raw data should be experimental assays. Any quantities that result from manipulation of the raw data/parameters such as subtraction/division should be derived quantities.
A: Data can be entered using the web interface, which supports manual input as well as upload of data in a spreadsheet format via comma-separated values (CSV) files. Select the Submission Tools icon, then click the Create a new submission icon on your Home page to get started. Hovering over an input field brings up help text to guide you. Details on data entry are also provided in the ProtaBank tutorial (click Tutorial in the ProtaBank banner at the top of every page).
For advanced users, an application programming interface (API) is also provided for batch upload of large data sets. For details, click API in the ProtaBank banner that appears at the top of every page.
A: Tabular data stored in a spreadsheet can be saved as a comma-separated values (CSV) file, so this is often a good starting point. A separate CSV file should be created for each data set. Instructions are given at the top of the web form (accessed via the Add Mutational Data link).
A new line is used for each mutant sequence, with each of the data values separated by a comma. The mutant must be specified first (in column 0), followed by its associated assay data (in columns 1, 2, 3, 4, etc.). If the first column is empty, that line is ignored (useful for comments, headers, etc.).
Data (columns 1, 2, 3, 4, etc.) can be numerical, given as a range or limit (e.g., 20-30, >99), or qualitative (i.e., text such as "unfolded" can be used). Negative numerical values must be indicated with a dash (-) in front of the number (e.g., -23); if a subtraction sign, em-dash, en-dash, or other character is used, the data will be treated as qualitative. E notation can be used for large numbers (e.g., 1.23E+10 for 12345678901); scientific notation is not allowed. Blanks in any column other than 0 indicate no data (ignored), whereas ND, NA, a dash by itself, etc. are treated as qualitative data. Abbreviations should be defined in the assay details.
Mutants (column 0) can be specified using the entire mutant sequence or as mutations from a specified starting sequence. Two formats are available for the latter:
WT#MUT+WT#MUT (wild-type amino acid, residue #, mutant amino acid) format, with each mutation separated by a plus sign (e.g., Y3F+L5I+I6V). See example of CSV file and spreadsheet used to create it.
Mutated Residue Range/List format, which correlates positions in the starting sequence with the amino acids given in the CSV file (e.g., FSI for residues 3-5; or FIV for residues 3,5,6). See example of CSV file and spreadsheet used to create it.
"WT" can be used to specify the starting sequence (the one being mutated, which is not always wild-type), or for the Range/List format, the starting amino acids for the mutated positions can be given.
A: A plot only appears if the two assays selected share more than 1 sequence in common. If no plot appears, check that the two assays share common sequences.
A: The plot is a violin plot, showing a smoothed distribution, where the white dot shows the median, the grey bar shows the middle 50% (1st quartile to 3rd quartile), and the whiskers show the minimum and maximum values; outliers beyond these extrema are shown as black dots.
A: All submitted data is validated to ensure data integrity before inclusion in the database. Automated tests are performed to ensure that: (1) the data falls within the correct range of values, (2) the assigned units are appropriate for the assayed property, and (3) the amino acid listed for wild type is consistent with that specified in the starting sequence. Outliers in a data set are also flagged. The submitter is immediately warned if any errors are detected.
Currently, ProtaBank developers also review studies manually. The publication is checked to ensure that the details match the description of the data. Protein details, such as the specified PDB or UniProt ID are verified to ensure that the correct protein has been identified. Assay descriptions are checked to ensure that each assay is unique and that the selected category, property and units are sensible. Potential errors or suggestions are sent back to the submitter for review.