Submit High-throughput Sequencing Reads to NCBI Sequence Read Archive (SRA)¶
Goal¶
This workflow enables CyVerse users to make submissions to the NCBI Sequence Read Archive (SRA). A submission included compressed sequenced files (FASTQ.gz, SFF.gz, and BAM.gz) and an XML metadata file, organized into a package. If you need to submit an alternative file format (HD5, SOLiD, and SRF) please email support@cyverse.org
Create and organize an SRA submission package¶
Important
A Reminder on File Names and other SRA Requirements
NCBI has extensive requirements for depositing data into the SRA. The CyVerse submission pipeline is an attempt to streamline this process, but your submission must meet several requirements. File names must be unique, and may not contain special characters (e.g. { } ? * . , etc.) or spaces.
The SRA Submission Quick Start is the authoritative guide to SRA requirements. It is worth reading through this before submission.
Important
This quickstart assumes you have uploaded your files to the CyVerse Data Store. If not, following the directions for uploading files to the Data Store. If possible, you may wish to compress these files using gunzip/bzip2 before upload.
(Optional) Compress files in the CyVerse Data Store using gunzip¶
For submission through CyVerse, the sequence files must be compressed. If your files are compressed (.gz/.bz2) you may skip this step.
- Login to the CyVerse Discovery Environment
- Click this link to open the Compress files with gzip App or click the Apps button and search for the “Compress files with gzip” App.
- Under “Inputs” select the individual file (FastQ/SFF/BAM) to compress
- Click ‘Launch Analysis’ to compress the file and click the Analysis button to monitor job status and view results. Once all files are compressed you may wish to gather them into a single folder to begin your submission.
Tip
Each file (e.g. read_R1.fastq) must be individually compressed (e.g. read_R1.fastq.gz)
Create SRA submission package and add sequence data¶
An SRA submission requires that your sequencing data are organized in a specific structure of folders and subfolders. It will be helpful if you are familiar with SRA terminology:
Tip
SRA Terminology
Adapted from NCBI:
- BioProject: A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium.
- BioSample: The BioSample database contains descriptions of biological source materials used in experimental assays. Different experimental conditions, tissue types, etc. would typically be different BioSamples.
Your submission will contain the following folders and data:
Folder Description Contents Metadata BioProject (One per submission) This is the top-level folder and will contain all other folder and data. One or more BioSample folders A BioSample Metadata template will specify details like the project summary, submitter/investigator information, etc. BioSample (One or more per submission) BioSample folders, one for each experiment/ tissue type, etc. One or more library folders A species/tissue specific metadata template will describe biological information about the sample (e.g. sex, collection location, etc.). Library (One or more per submission) These folders will contain the sequencing data. Each replicate will have its own library folder. Each library folder will contain one either one (single-end) or two (paired-end) sequence files. A library metadata template will describe information about the sequencing run
Login to the CyVerse Discovery Environment.
Click on Data to open a data window. In your home directory (or any directory your own) create the submission folder; Click on the “File” menu, then click ‘Create’ and select ‘Create NCBI SRA Submission Folder’.
Enter the name of your BioProject, and the number of BioSamples and libraries. click ‘OK’ to create the submission folders.
Tip
In our Example data We have two experimental conditions with 3 sequencing replicates each so 2 BioSamples and 3 libraries in each BioSample
Place your sequence files in the appropriate library folders.
Examine the example submission (BioProject_SRA_QuickStart folder) for refference and ensure your sequencing samples are appropriately organized.
Fix or improve this documentation
- On Github: Repo link
- Send feedback: Tutorials@CyVerse.org
Record metadata and associate with SRA submission¶
In this section, you will complete 3 metadata templates and associate these metadata with your submission.
- BioProject: Metadata describing the overall project
- BioSample: Metadata describing the source materials sampled
- Library: Metadata describing the sequencing runs
Warning
The metadata templates are defined by NCBI. Each field must be completed exactly as described. Any typos or invalid entries will cause your submission to be rejected.
I. Complete BioProject Metadata Template¶
We will complete the BioProject Creation template to begin a new submission. You can download a completed example template here
Login to the CyVerse Discovery Environment
Click on Data to open a Data Window, and from the “Metadata” menu select ‘Download Template.’ Select the ‘NCBI BioProject Creation’ and click ‘OK’ to download.
Tip
The download will contain two compressed (.zip) CSV files:
- blank.csv: This is the template to complete
- guide.csv: These are the NCBI specified instructions for the template On your local computer
You can edit the blank.csv template using any spreadsheet editor such as Excel
Complete the metadata template according to the instructions in the guide; save the metadata file as a .csv file (for Excel we recommend saving as ‘CSV UTF-8 (Comma delimited)’). You may name this file whatever you wish we suggest a name other than blank to avoid confusion with other templates.
Warning
Excel and other spreadsheet editors may overwrite information (such as dates) with formatting. This makes Excel notorious for use in bioinformatics. Double check that dates and other entries in your template are not accidentally edited by Excel. When editing, you may use the Formatting feature (Format menu > Cells) to ensure your sheet is entirely formatted as text
Tip
Tips for completing metadata templates
Completing the file name or path field The first column of CyVerse metadata templates is the ‘file name or path’ field. What should be entered is a path to the folder the metadata should be applied to. You can get this path from the Discovery Environment Data Window’s “Viewing:” field:
When copying your path ensure the name of your top-level BioProject folder is included
Following Guide Instructions
The guide gives additional information about the template including if a template feild is required, and if specific values (taken from controlled vocabulary) must be used. Something to remember include:
- If an items ‘required’ value is TRUE you must provide a value, or enter one of the following null values: ‘not collected’, ‘not applicable’, or ‘missing’.
- When prompted for an email address enter the address associated with your NCBI account. Notifications will only be sent to this address.
- The value type field indicates if the response is a single line of of alphanumeric characters (string), a multiline response, or an enumerated value (Enum). If a field must be an Enumerated Value (Enum) only use one of the terms specified in the guide
- Dates must be entered in the order specified by NCBI (e.g. Year-Month-Day)
Upload the completed template; from a Data window “Upload” menu chose ‘Simple Upload from Desktop’. You may upload to the same directory as your top-level BioProject folder, but do not place metadata files in your submission folders. (You may need to click “Refresh” to see the uploaded file)
Associate the metadata with your BioProject; in a Data Window, select your BioProject top-level folder. From the “Metadata” menu select ‘Apply Bulk Metadata’ and then ‘Select Metadata File’; select the uploaded metadata and browse to the uploaded file and click ‘OK’. You should get a notification that the metadata application was successful.
In a Data Window, select the BioProject folder and in the “Metadata” menu click ‘Edit/View Metadata’ to verify the metadata is applied and accurate.
II. Complete BioSample Metadata Template¶
You will next need to select the appropriate BioSample template (organism/sample specific) and apply this to all of your BioSample folders. Most of the information may be the same for each BioSample, with differences including things like treatments and/or tissue sources. You can view a completed example template here
If necessary, login to the CyVerse Discovery Environment
Click on Data to open a Data Window, and from the “Metadata” menu select ‘Download Template.’ Select and download the a “NCBI BioSample” appropriate for your submission. If you are unsure about which template to select; post a question to the CyVerse User Forum.
Complete the metadata template (See the warnings and tips in the BioProject Instructions above).
Important
You must complete a row of metadata for every BioProject folder. The metadata for all your BioSamples can remain in the same file, assuming that template is appropriate for all the BioSamples in your project. If you require more than one BioSample template, you will need to complete a separate template for each relevant BioSample.
Upload the completed template; from a Data window “Upload” menu chose ‘Simple Upload from Desktop’. You may upload to the same directory as your top-level BioProject folder, but do not place metadata files in your submission folders. (You may need to click “Refresh” to see the uploaded file)
Associate the metadata with your BioSample; in a Data Window, select your BioProject top-level folder. From the “Metadata” menu select ‘Apply Bulk Metadata’ and then ‘Select Metadata File’; select the uploaded metadata and browse to the uploaded file and click ‘OK’. You should get a notification that the metadata application was successful.
Tip
Although you select your BioProject folder, since your metadata template specifically indicates the path your BioSample folders, metadata will be applied to those subdirectories.
In a Data Window, select a BioSample folder and in the “Metadata” window click ‘Edit/View’ Metadata to verify the metadata is applied and accurate. Verify the metadata for each of your BioSamples.
III. Complete Library Metadata Template¶
This final template will need to be completed for every BioSampleLibrary folder. You can view a completed example template here
If necessary, login to the CyVerse Discovery Environment
Click on Data to open a Data Window, and from the “Metadata” menu select ‘Download Template.’ Select and download the a “NCBI SRA Library” template.
Complete the metadata template (See the warnings and tips in the BioProject Instructions above).
Important
You must complete a row of metadata for every BioSampleLibrary folder. The metadata for all your libraries can remain in the same file.
Upload the completed template; from a Data window “Upload” menu chose ‘Simple Upload from Desktop’. You may upload to the same directory as your top-level BioProject folder, but do not place metadata files in your submission folders. (You may need to click “Refresh” to see the uploaded file)
Associate the metadata with your BioSample; in a Data Window, select your BioProject top-level folder. From the “Metadata” menu select ‘Apply Bulk Metadata’ and then ‘Select Metadata File’; select the uploaded metadata and browse to the uploaded file and click ‘OK’. You should get a notification that the metadata application was successful.
Tip
Although you select your BioProject folder, since your metadata template specifically indicates the path your library folders, metadata will be applied to those subdirectories.
In a Data Window, select a BioSampleLibrary folder and in the “Metadata” window click ‘Edit/View’ Metadata to verify the metadata is applied and accurate. Verify the metadata for each of your BioSampleLibrary folders.
Warning
Once you have finished adding metadata to your submission folders, you cannot move or rename those folders without going back to edit the metadata entries.
IV. Generate summary metadata file¶
We will now generate a file that captures the metadata for the entire submission. In the next step, we will validate our results. You view an example of this file here
- If necessary, login to the CyVerse Discovery Environment
- Click on Data to open a Data Window, and select your top-level BioProject folder. From the “Metadata” menu, select ‘Save Metadata to file’; save the file with a descriptive name and a .xml ending (this make take a few minutes to generate; you may need to click “Refresh” to see the file).
Fix or improve this documentation
- On Github: Repo link
- Send feedback: Tutorials@CyVerse.org
Validate and submit package to SRA¶
I. Validate Submission¶
In this section, you will verify that metadata has been appropriately associated with your submission package and complete the submission process.
Login to the CyVerse Discovery Environment
Click the link to open the NCBI SRA Submission - BioProject Creation App, or in the Discovery Environment, click Apps to open Apps menu and search for the “NCBI SRA Submission - BioProject Creation” App.
If desired, enter an analysis name or comments.
Under “Inputs” check ‘Validate metadata file only?’.
Under ‘Select BioProject Folder’ browse to and select the top-level BioProject folder. Under ‘Select BioProject metadata file’ browse to and select the previously generated metadata file (.xml).
Click ‘Launch Analysis’ to begin the validation, and click on Analyses to monitor the job progress. When status is ‘Completed’, click on the job name to view results. A successful validation will generate two folders
- A folder of logs
- A folder with your username and a long alphanumeric string. This folder will contain the submission.xml metadata file associated with your submission.
Tip
Although a job returns with the status ‘Complete’ that does not mean that the submission is error-free. In the submission process, NCBI will review the submitted files and metadata and may discover errors.
For failed validation/submissions For either validation or submission, if the app fails and no submission.xml file is created, there are one or more errors in the submission package. See the Analysis log files (especially condor-stderr-0) for information to assist with error correction.
II. Send submission package to SRA¶
In this step, we use the application above, but the option to ‘Validate metadata file only’ is left unchecked.
If necessary, login to the CyVerse Discovery Environment
Click the link to open the NCBI SRA Submission - BioProject Creation App, or in the Discovery Environment, click Apps to open Apps menu and search for the “NCBI SRA Submission - BioProject Creation” App.
If desired, enter an analysis name or comments.
Under ‘Select BioProject Folder’ browse to and select the top-level BioProject folder. Under ‘Select BioProject metadata file’ browse to and select the previously generated metadata file (.xml).
Click ‘Launch Analysis’ to begin the validation, and click on Analyses to monitor the job progress. When status is ‘Completed’, click on the job name to view results. A successful validation will generate two folders
- A folder of logs. You should find a ‘manifest.txt’ file documenting the files transferred to the SRA
- A folder with your username and a long alphanumeric string. This folder will contain the .xml metadata file associated with your submission.
Fix or improve this documentation
- On Github: Repo link
- Send feedback: Tutorials@CyVerse.org
Confirm submission to SRA and fix errors¶
CyVerse systems connect to SRA systems and create the submission folder on the SRA side. Files are transferred and a ‘submit.ready’ file is sent to the SRA to signal that the submission package is complete and they can begin processing. The SRA system validates the submission package and generates a report.xml file containing any errors detected. The SRA system sends notification email(s) to the contact email provided in the BioProject metadata template, and to the CyVerse team to notify of either a successful or failed submission.
If your SRA submission is successful¶
You will receive an email (“Submission ownership transfer”) at the email address provided in the package metadata (also associated with your NCBI account). After ownership transfer, you can view the submission progress at https://submit.ncbi.nlm.nih.gov/subs/. You may need to log in with the NCBI credentials for the account you used in the submission metadata.
If your SRA submission contains errors¶
You will recive an email at the email address provided in the package metadata (also associated with your NCBI account) informing you about the error. you can retrieve the submission report.xml file from SRA servers with the ‘NCBI SRA Submission Report Retrieval’ App in the DE, make corrections, and resubmit.
Login to the CyVerse Discovery Environment
Click the link to open the NCBI SRA Submission - Report Retrieval App or or in the Discovery Environment, click Apps to open Apps menu and search for the “NCBI SRA Submission - Report Retrieval” App.
If desired, enter an analysis name or comments.
For “Inputs” under ‘Select NCBI SRA Submission App Output Folder’ browse to and select the output folder previously generated in the submission (this should have your username followed by an alphanumeric string).
Click ‘Launch Analysis’ to begin the validation, and click on Analyses to monitor the job progress. When status is ‘Completed’, click on the job name to view results. The App will generate two folders:
- A folder of logs.
- A folder with your username and a long alphanumeric string. This folder will contain a report detailing the errors detected. You should correct these errors (see tip below) and resave the metadata file (see IV. Generate summary metadata file).
Tip
Fixing errors in metadata entries
To fix metadata entries, it is not not necessary to download/edit/upload templates. For a folder you wish to correct, select the folder in a Data window and from the “Metadata” menu select ‘Edit/View metadata’; select any entry you wish to edit and then click the ‘Edit’ button. Save your edit and then click ‘Save’ again to save the metadata file.
Other tips
- Remember to save a new metadata file from the top level of the submission package before resubmitting. It is best practice to name this file differently from the previous metadata file.
- During error correction, only make changes to SRA-detected errors. All other changes will be ignored by the SRA during resubmission. If additional changes are required, they can be made using the NCBI website after successful submission.
- If no report.xml is retrieved, this does not necessarily mean your submission failed. The SRA system may not have generated it yet. Make sure to wait for notification from the SRA that the submission has been received and processed.
Next Steps:
Once you have verified your files are available to the SRA, you can consider deleting these files from the CyVerse Data Store.
Additional information, help¶
Search for an answer: CyVerse Learning Center or CyVerse Wiki
Post your question to the user forum: Ask CyVerse
Fix or improve this documentation
- On Github: Repo link
- Send feedback: Tutorials@CyVerse.org
Prerequisites¶
Note
To complete this tutorial, you must upload your FASTQ/SFF/BAM files to the CyVerse Data Store. See the Data Store Guide for instructions on how to upload your files (for example, using Cyberduck). Also, you will need detailed metadata about the sample being submitted ( e.g. collection/accession information, cell line/tissue metadata, etc.) and the sequencing platform used (e.g. library preparation strategy, sequencing instrument, etc.). These requirements will vary for the organism sequenced and are discussed in detail in the metadata section of this quickstart.
Downloads, access, and services¶
In order to complete this tutorial you will need access to the following services/software
Note
Register for an NCBI Account
If you do not have an NCBI account (you can check for an existing account logging in at https://www.ncbi.nlm.nih.gov/account/ ); register at https://www.ncbi.nlm.nih.gov/account/register/.
Platform(s)¶
We will use the following CyVerse platform(s):
Platform | Interface | Link | Platform Documentation | Quick Start |
---|---|---|---|---|
Data Store | GUI/Command line | Data Store | Data Store Manual | Guide |
Discovery Environment | Web/Point-and-click | Discovery Environment | DE Manual | Guide |
Input and example data¶
In order to complete this quickstart you will need to have the following inputs prepared
Input File(s) | Format | Preparation/Notes | Example Data |
---|---|---|---|
FastQ/SFF/BAM files | FastQ/SFF/BAM compressed (.gz/.bz2). | For additional details see the SRA File Format Guide | Sample truncated sample FastQ files |
Fix or improve this documentation
- On Github: Repo link
- Send feedback: Tutorials@CyVerse.org