Tutorial

Video Tutorial

A. Overview

Creating a new submission

Start a new submission by selecting Submission Tools then clicking on the Create a new submission icon on your Home page. The user first selects whether the data to be submitted is from published or unpublished work, then enters basic information on the study (title, authors, year, abstract, etc.). The Summary/Links page is then displayed, which summarizes the information entered so far and provides links to additional entry forms. These forms are used to enter details on the protein that was mutated, the type(s) of assay data and quantities reported, and the mutant sequences and their data values.

Data entry forms must be filled out that specify:
  1. The study: This information is entered at the outset via the Create From Published Study button or the Create From Unpublished Study button. If the study is published, this data can be fetched from PubMed via the PubMed ID.
  2. The protein that was mutated: Use the Add Protein link. This information (protein name, organism, PDB or UniProt ID, sequence) can be fetched from PDB via the PDB ID or from UniProt via the UniProt accession number.
  3. Each of the quantities reported, including property measured, data source (experimental, computational, or derived), and technique: Use the Add Assay link.
  4. For each data set, the mutant sequences and their associated data values: Use the Add Mutational Data link. For de novo designed sequences: Use the Add/Edit/Remove Individual Sequence Data link.
The mutant sequences and their data values can be input in three different ways:
  1. By entering the entire sequence and associated data value individually (intended primarily for de novo designed sequences).
  2. By providing a starting sequence and describing the mutants (i.e., listing just the amino acids for the position(s) that were mutated) and associated data value individually.
  3. By providing a starting sequence and describing the mutants and data values in a CSV (comma separated values) file that is uploaded. For large data sets, providing a CSV file is recommended. If there are multiple data sets in the study, each should be specified in a separate data library.

A SAVE button is provided at the bottom of each data entry page, which saves the entries for that page and returns you to the Summary/Links page. If errors are detected or a required field is not filled out, the user is warned and/or directed to the missing item that must be completed before the page can be saved.

Details on how to create a CSV file, fill out each of the data entry forms, determine whether a reported quantity is experimental, derived, or computational etc. are described in Sections B and C.

Note: Only entry fields marked "required" must be filled in. However, we strongly recommend completing the other fields as well.

User interface
  • Links, Action items. Text or buttons displayed in blue are links and/or action items (e.g.,Add Protein links to the Protein data entry page; saves the entries on the current page and returns to the Summary/Links page). Some important buttons or links are displayed in red.
  • Edit, Delete icons. These icons are displayed at the right next to each of the summarized entries on the Summary/Links page. Clicking the Edit icon brings up the relevant entry page to make changes. Clicking the Delete icon deletes the associated entries entirely (confirmation is required).
  • View icon. The View icon is displayed at the right next to each of the mutational data library entries on the Summary/Links page. Clicking it brings up details on the library.
  • Help text. For items that are not self-explanatory, hovering over a data entry field brings up text that clarifies or gives examples. Greyed out text may also be displayed in the field as an example of what is needed or to give more detailed instructions.
  • FAQs. Useful tips can also be found under Frequently Asked Questions (FAQs). Just click FAQs in the ProtaBank banner that appears at the top of every page.
Checking entered information

Summaries of the information entered so far are displayed on the Summary/Links page so that you can check that your entries are correct. As described above, the Edit and Delete icons at the right allow you to make any needed changes or delete the associated entries entirely, and the View icon allows you check mutant library data. The total # of Data Points for each data set/library is also given so that you can confirm that all the data was included.

Editing in progress submissions

You can leave ProtaBank at any time during the submission process and return later to finish the entries or to make changes (all the saved entries are retained). Just click the See my in progress submissions icon under Submission Tools on your Home page, then select the desired Study ID from those listed. You can also delete an in progress submission by clicking the Delete icon in the In Progress Submissions list.

Submitting the study, specifying when it will be available to the public

After completing and reviewing the entries for all the data sets, you can submit your study by clicking the Submit Study to Database button at the bottom of the Summary/Links page. You will be asked to specify the date the study will be publicly available. This date can be a maximum of six months from the current date. If you want the embargo to extend beyond this period, contact ProtaBank support.

Validating the data

Automated tests are then performed to ensure data integrity, and the submitter is immediately warned if any errors are detected. ProtaBank developers also check studies manually and send potential errors back to the submitter for review. If needed, you can provide the developers with feedback or additional information. Once studies pass these validation and curation steps, they are included in the ProtaBank database and made available to other users for viewing, searching, etc.

Note: Once submitted, you can't make revisions to a study.

B. Preparing your data

Before filling out the entry forms, it is useful to organize your data (i.e., decide how many data sets will be entered and what information will be included in each), and get it into the proper format (i.e., prepare a CSV file for each data set).

Identifying the data sets

Many studies describe their results in a set of tables, with each table reporting results from a different set of assays or from experiments or analyses designed to answer different questions. Frequently, data is obtained for a large set of variants, and then additional or more extensive experiments are carried out on a subset of these (e.g., the hits). Thus, it often makes sense to provide the data in each table in a separate data set or library. You should decide on the data sets you will be submitting because a separate entry form must be filled out for each.

The reported quantities

The reported quantities are protein properties that were: (1) obtained experimentally (raw data or parameters fit to raw data), (2) obtained from computational modeling/simulations, or (3) derived from other reported quantities (e.g., via subtraction or division). Typically, each quantity corresponds to one of the columns in a results table or spreadsheet and must be matched with the data in your CSV file. An entry form must be filled out for each quantity to specify the property and to provide details on the experimental techniques and conditions used, to describe the computational protocols employed, or to indicate how the quantity was derived. This information will be useful in comparing and analyzing ProtaBank data.

Preparing the CSV file(s)

Tabular data stored in a spreadsheet can be saved in CSV (comma separated values) file format for upload to ProtaBank. Unless you only have a handful of variants, a CSV file is typically preferred over entering the data manually. A separate CSV file should therefore be created for each data set.

File format: A new line is used for each mutant sequence, with each of the data values separated by a comma. Typically the mutant is specified first (in column 1), followed by its associated experimental assay/computational protocol/derived data (in columns 2, 3, 4, etc.). You may keep comments, labels, headers etc. in your CSV file in additional columns.

Data can be numerical, given as a range or limit (e.g., 20–30, >99), or qualitative (e.g., text such as "unfolded" can be used to indicate that the protein was unfolded, "ND" or "NA" can be entered to indicate that a value was not determined or does not apply, etc.). This type of "negative data" provides information and is encouraged (as opposed to leaving the data field blank). If your data includes standard errors or standard deviations, these should be entered in a separate data column in the CSV file. See FAQs for details and CSV file examples.

C. Filling out the data entry forms

The Study

Information on the study is requested when creating a new submission. You must select one of two buttons to specify whether the data being submitted is from a published study or from unpublished work, then fill in the details on the data entry form that comes up.

For published studies, you can enter the PubMed ID then click the Fetch Publication Details by ID button to have all the details entered automatically. Or you can fill in each of the fields by hand (Title, Authors, Journal, Year, Abstract, etc.). The Abstract should describe the major goals and results of the study.

For unpublished work, a title is required, along with the investigators who worked on the study, the laboratory or organization where the work was done, and the date when the data was collected. An abstract is recommended.

A Study ID is automatically assigned, and the Submission Date, Submitter (login name), and Version are recorded. When the study is submitted, the Version will become 1; prior to that, it reads "not submitted." Submission of the study allows it to be searchable in the database.

Protein Details

Use the Add Protein link to bring up the entry form describing the protein that was mutated. If you enter the UniProt ID or the PDB ID in the field at the top, you can then click the appropriate Fetch by … button to have the rest of the details entered automatically. Or you can fill in the fields by hand. The common name of the protein and/or the domain or fragment that was engineered should be entered under Protein Name. The Organism refers to the species the protein is found in (not the host it may have been expressed in). If one or more protein structures were used in the design, enter the PDB ID and optional Chain identifier for each, separated by commas (no spaces). The UniProt accession number and protein Sequence can be included if desired, but are not needed.

Note: If you use the UniProt ID to fetch the data, all the PDB IDs associated with the protein will be listed; we just want those used in your study, so the rest should be deleted. Also, the fetched sequence will be for the whole protein. If your mutations were done on a particular chain, domain, or fragment, then just the sequence for that portion should be listed.

Sequence Display. If a sequence was entered in the Sequence field, it will also be displayed below in a form that makes it easier to check for mistakes (with numbers indicating residue positions and residues color-coded by amino acid type); a grey slider bar is provided that can be moved to view different segments along the sequence.

Multiple proteins: If you are submitting mutation data for more than one protein in this study, you should fill out a separate entry form for each. The mutational data for each protein must also be specified in a separate data set.

Expression Details

Clicking the icon under Expression Details brings up a form for entering this type of information. Details on what's needed in each field are provided in the hover text. The only required field is DNA Sequence, which lists the DNA sequence of the expressed protein sequence including tags. The DNA sequence can be entered automatically by entering the GenBank accession number in the field at the top and clicking Fetch by GenBank Acc. No. The DNA sequence will also be displayed at the bottom in a numbered, color-coded format that makes it easier to check for mistakes, similar to that described above for protein sequences.

Although useful, this information is currently not required by ProtaBank search and analysis tools, which are based solely on the protein sequence (without any expression or other tags attached).

Assays (Experimental, Computational, or Derived)

Use the Add Assay link to bring up the entry form describing the quantity being submitted, which includes the property reported, as well as the units, source, techniques, and conditions used to obtain the quantity. A separate form must be filled out for each quantity submitted. The forms differ somewhat depending on the source (experimental measurement, computational/simulated result, or derived from a previously reported assay). Fill out assay forms for experimental and computational/simulated quantities first, as this information will be required to specify any derived quantities.

For all entry forms, a Fill in Details from Previously Entered Assay field appears at the top that lists all the assays already entered for the study (if any). This can be used in conjunction with the Fetch Assay Details button to autofill the fields with information from a previously entered assay, which can be useful if the information overlaps significantly.

  1. Common assay fields. These fields are the same for all quantities. The first six are all required.
    • Name is used to identify the quantity. The column header from the associated results table is recommended, as this name will be matched to a particular column in your CSV file.
    • Category is the general protein property that was engineered or studied. Select from those listed in the drop down or if none of these fit, choose Other, and enter a new Category. Note that Activity should be chosen for all activities measured in vitro or in cell-based assays. For activities obtained in preclinical studies involving animal models (e.g., anti-tumor activity in a mouse model) or in patients, choose Pre-clinical/Clinical Properties. In selections or survival studies (e.g., phage display), sequencing abundance or counts, etc. are often an indirect measure of another property (e.g., Activity, Binding, Stability). Choose this indirect property as the Category, and then choose Count/Number, Frequency of Occurrence, Enrichment, etc. as appropriate for the specific Property (see below).
    • Property is the more specific protein property that was measured. The drop down lists those commonly reported for the Category chosen.
    • Units. The drop down lists the units relevant for the Property chosen.
    • Data comes from. Choose the appropriate button:
      • Experimental measurement: for quantities obtained experimentally (i.e., raw data or parameters fit to raw data), e.g., melting temperature (Tm), dissociation constant (Kd), Gibbs free energy of unfolding (ΔG).
      • Computational/Simulated result: for quantities obtained using molecular modeling, computer simulations, or other protein engineering/property prediction software.
      • Derived from previously reported assay: for quantities derived from other reported quantities (e.g., differences or ratios of other reported quantities, such as ΔΔG where ΔGs are also reported, kcat/Km where values for both kcat and Km are also reported).
    • Technique refers to the experimental technique. The drop down lists those commonly used for the Property chosen. Multiple techniques can be chosen for an assayed quantity if appropriate.
    • Details. Use this field to provide additional information.
    • Auto-create an assay/quantity for associated Standard Error. Check the box if you have this data and want to include it.
    • Auto-create an assay/quantity for associated Standard Deviation. Check the box if you have this data and want to include it.
  2. Source-specific assay fields. Additional fields are also displayed that depend on the data source. Although not required, this information can be helpful in comparing and analyzing data, so filling in these fields is recommended. In all cases, additional information can be included under Details.
    • Experimental: Buffers, Temperature, pH, Protein Concentration, Wavelength. These fields describe experimental conditions that are often important. Fill in those that are relevant for the technique specified. If multiple techniques are specified where these conditions differ, indicate which conditions apply to which technique in the Details.
    • Computational/Simulated: Software, Score Function/Force Field.
    • Derived:
      Related assays. Specifies previously entered assay(s) whose data was used to derive the quantity. If none of those listed are appropriate, the quantity should probably not be classified as derived.
    • Formula Input. The formula used to derive the quantity. Can be typed in directly or entered in LaTeX or AsciiMath notation. For example, ΔΔG = ΔG(mutant) – ΔG(WT).
      MathJax Formula. Displays the formula entered in Formula Input in equation format.
Mutational Data and Individual Sequence Data

A separate entry form must be filled out for each data set. Links to two different entry forms are provided. The Add/Edit/Remove Individual Sequence Data link is intended primarily for de novo designed sequences; the entire sequence is listed for each individual sequence (referred to as individual sequence data). The Add Mutational Data link is for mutants obtained by mutating a given protein (referred to as mutational data). In this case, you can just list the amino acids for the position(s) that were mutated; the entire sequence for each of the mutants can be entered instead if desired.

Generating a CSV file template. ProtaBank can automatically generate a comma separated values (CSV) file template from the assay information that you have entered (the Name of each assay heads a column). Click the here link in the Mutational Data section on the Summary/Links page to download the template.

  1. Add Mutational Data link. Entries are required for all fields on this entry form; details on what's needed in each field are provided in the hover text.

    Fields include: what's in the data set (Description); the name of the Protein mutated; the sequence that was mutated (Starting Sequence, not always WT); the Syntax used to specify the mutant sequences; if the Range or List syntax is used, the positions mutated (Mutated Residues); and the Index of the Initial Residue.

    Additional fields are provided that allow you to specify the actual mutant sequences and their associated data values. These fields vary depending on whether you are uploading your data from a CSV file (recommended) or entering it manually. Click the Upload a CSV file button or the Manually input mutant data button to display these fields.

    • Upload a CSV file: Recommended in general, particularly for large data sets. Tabular data stored in a spreadsheet can easily be saved in CSV file format. You should prepare your CSV file(s) before filling out this form. Instructions are provided at the top of the form and in Section B above. If desired, you can use a CSV file template (see Generating a CSV file template above). See FAQs for additional details and CSV file examples.

    • The CSV file is uploaded using the Choose File button. The fields at the bottom (Column and Assay/Derived Quantity/Protocol) are then used to match the columns in the CSV file to the appropriate quantity (Name of the experimental assay, derived quantity, or computational protocol) entered previously. Click on add another as needed to specify the data in each column. If a CSV template file was used without modification (assay names are in same columns), you can use the Fetch Template Description button to populate these fields automatically.

    • Manually input mutant data: Useful for very small data sets. Each of the mutants with their associated data values and quantities are entered manually. This is done via the Mutant, Data, and Assay/Quantity/Protocol fields.

    Checking for errors. Clicking SAVE at the bottom of the input form brings up a Library Details window that summarizes the data entered so you can check for errors. This includes a library data table for the first 100 variants, which lists the mutations (Mutant Description), data value (Data), Units, name of the assay (Assay), and Full Sequence for each variant.  If errors are detected, you can return to the input form by clicking on the continue editing link at the top of the window. If no errors are noted, click the Finished, Return to Study Page link in the top right corner.

  2. Add/Edit/Remove Individual Sequence Data link. Intended primarily for de novo designed sequences (as opposed to a set of mutants obtained by mutating a given protein). For each individual sequence, the entire sequence is entered manually along with its associated data values. This is done using the Sequence, Data, and Assay/Quantity/Protocol fields.