Improve documentation!

Introduction

Purpose of this guide is to give a consolidated and authoritative overview of OSD 2014 data.

You should

get an overview of which data from OSD 2014 is available where
understand how the data was generated
and bring you in the position to correctly use and interpret the data

Overview

In the days around solstice June 2014 more than 150 scientist teams collected samples around the world.

The result was:

150 metagenome samples from around 150 sites
155 samples collected by protocol NPL022 and 16S amplicon sequenced by LGC (lgc)
155 samples collected by protocol NPL022 and 18S amplicon sequenced by LGC (lgc)
7 samples collected by protocol NPL022 and 16S amplicon sequenced by Australia (ramaciotti)
30 samples collected of by protocol NE08 and 18S V4 region amplicon sequenced by LifeWatch Italy (lw) see protocol for details
32 samples collected by protocol NE08 and 18S V9 region sequenced by LifeWatch (lw) see protocol for details

Details about protocols NPL022 and NE08 are in the OSD Handbook

Material and Methods for “OSD 2014 - from sampling to sequencing” can be downloaded here

Please note that we have different sequencing centers who did sequencing:

LGC Genomics (shorthand: lgc), our main sequencing center who did metagenomes and 16/18S from protocol NPL022
LifeWatch Italy (shorthand: lw), who kindly provided sequencing for protocol NE08 samples see protocol for details
Australia - Ramaciotti (shorthand: ramaciotti), who for legal reasons had to sequence all 7 Australian sites

Initial sequence data pre-processing

The sequence data as delivered by the sequencing centers was pre-processed in order to derive common data sets on which to base follow-up analysis. Please see wiki page on pre-processing for details

In summary the pre-processing results in two kinds of quality controlled sequence data sets raw and workable for each input sequence set:

For amplicon data the output files per sample are:
- raw: non-merged
- workable: merged
For shotgun data the output files per sample are:
- raw: non-merged (used e.g. by EMG)
- workable output files
  - merged (used e.g. by mg-traits)
  - non-merged (used e.g. for assemblies)

Data deposited in public archives and available on web sites

Environmental Data

Measured by OSD Site Coordinators: OSD 2014 Environmental Metadata
- CSV with | (pipe) as delimiter and UTF-8 encoded
- take care adjust the settings accordingly while importing this into e.g. EXCEL
- Detailed documentation of file structure and content
Calculated environmental data based on data from public environmental databases: OSD 2014 Environmental Ancillary Data
- Including data based on Halpern et al. (as of 2015-12-05)
- Documentation can be found as readme sheets in the file
- This data was kindly generated by Dr. Shruti Malaviya, by crawling related public datasets
Scanned copies of log sheets are available at PANGEA

Sequence and other OSD project data

ENA archived data
- http://www.ebi.ac.uk/ena/data/view/PRJEB8682
- as of 2014-04-30 all metagenomes,16S and 18S raw data from OSD protocol NPL022
- LifeWatch 18S available (see below), archiving at ENA pending
All workable data is available here
All raw data are available here
Metagenome analysis by EMG based on raw data
- https://www.ebi.ac.uk/metagenomics/projects/ERP009703
Metagenome analysis by MG-Traits based on workable data
- http://mb3is.megx.net/mg-traits/samples
16S/18S analysis by SILVAngs
- see details below

Mapping between OSD metadata and ENA RUNs

The OSD 2014 Environmental Metadata includes a columns named osd_label.

Here you can find a file which maps these osd_labels to the respective ENA RUN identifiers.

How to find the correct data at EBI

The data set has the distinction between 16S and 18S is in the Run alias. The ENA browser displays the Run title (= a short informative description) rather than Run alias (= a submitter provided unique name, frequently being a unique ID meaningful only to the submitter). e.g. ERR867761 <RUN alias="OSD3-lgc-genomics-18S-199"/> <TITLE>Illumina MiSeq paired end sequencing; Illumina MiSeq sequencing of sample OSD3_2014-06-20_0m_NPL022from OSD-JUN-2014</TITLE>

Furthermore, The Run ERR867760 belongs to the Experiment ERX947555 The Run ERR867761 belongs to the Experiment ERX947554

Each Experiment has it’s own description, where the submitter clearly states which amplicon has been sequenced:

http://www.ebi.ac.uk/ena/data/view/ERX947555 (marine 16S rDNA amplicon sequencing) http://www.ebi.ac.uk/ena/data/view/ERX947554 (marine 18S rDNA amplicon sequencing)

Additional supplementary/ancillary data

We make available all other date (i.e. non-archived in public repositories) via MPI Bremen file server. This is the highest-level entry point.

Metagenomic data

Raw metagenomic datasets

All metagenomic raw data sets are archived at European Nucleotide Archive (ENA).

You can browse and download the archived metagenomic at European Nucleotide Archive (ENA) here:

https://www.ebi.ac.uk/ena/data/view/PRJEB8682
Based on the raw datasets as archived at ENA, the EMG pipeline analyzed all metagenomes:
- You can browse the EMG results here: https://www.ebi.ac.uk/metagenomics/projects/ERP009703/

Workable metagenomic datasets

Browsing EMG data tip

Clicking on a sample name will take you to a page where you can view and download the results of the EBI analysis pipeline (EMG) by clicking on the hyperlinks labelled “Taxonomy” or “Function” or the download icon in the “Analysis Results” column. You can also download the sequence data itself from these download pages, for example you can download the data and results for the sample identified as OSD15_2014-06-21_0m_NPL022 (ERS667653) here.

Assemblies

See the OSD assemblies page

EMG 16/18S rDNA analysis

We analysed the rDNA sequences identified by the EMG pipeline through SILVAngs and in addition we identified the rDNAs on the EMG derived dataset using the SINA aligner and SILVAngs.

Amplicon data (16/18S rDNA) Analysis by SILVAngs

NB: One of the cluster/OTU files in the SILVAngs ‘exports’ folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.

Main analysis was done using SILVAngs pipeline on the workable sequence data set. SILVA taxonomy version 119.1 was used for all 16S datasets and version 119 for all 18S datasets - the differences are very minor and can be viewed here.

The analysis was done for the sequence data as obtained from LGC, LifeWatch and Australia.

Note on taxonomy paths in MED exports

The MED exports contain a taxonomy path for each sequence inside the FASTA header. However this taxonomy is not filtered by 93% quality value, which is the default in SILVAngs. Therefore, to be consistent with other SILVAngs exports, an extra file with filtered taxonomy was added to the MED folder. See this issue for more details.

Analysis of workable 16S/18S rDNA from main sequence data set (by LGC)

16S

NB: One of the cluster/OTU files in the SILVAngs ‘exports’ folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.

18S

NB: One of the cluster/OTU files in the SILVAngs ‘exports’ folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.

Analysis of workable 16S rDNA dataset from Australia (sequenced by RGC)

NB: One of the cluster/OTU files in the SILVAngs ‘exports’ folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.

Analysis of workable 18S rDNA datasets (sequenced by Lifewatch Italy)

V4

NB: One of the cluster/OTU files in the SILVAngs ‘exports’ folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.

V9

NB: One of the cluster/OTU files in the SILVAngs ‘exports’ folders contains the wrong sequences. Please refer to issue #27 for a detailed explanation.

Documentation of

Ocean Sampling Day

Guide to OSD 2014 data

Introduction

Overview

Initial sequence data pre-processing

Data deposited in public archives and available on web sites

Environmental Data

Sequence and other OSD project data

Mapping between OSD metadata and ENA RUNs

How to find the correct data at EBI

Additional supplementary/ancillary data

Metagenomic data

Raw metagenomic datasets

Workable metagenomic datasets

Browsing EMG data tip

Assemblies

EMG 16/18S rDNA analysis

Amplicon data (16/18S rDNA) Analysis by SILVAngs

Note on taxonomy paths in MED exports

Analysis of workable 16S/18S rDNA from main sequence data set (by LGC)

16S

18S

Analysis of workable 16S rDNA dataset from Australia (sequenced by RGC)

Analysis of workable 18S rDNA datasets (sequenced by Lifewatch Italy)

V4

V9