首页 > 学术百科

Fast UniFrac ,PCoA 分析软件使用说明

Fast UniFrac is a new version of UniFrac that is specifically designed to handle very large datasets. Like UniFrac, Fast UniFrac provides a suite of tools for the com parison of m icrobial com m unities using phylogenetic inform ation. It takes as input a single phylogenetic tree that contains sequences derived from at least three different environm ental sam ples, a file m apping ids used in the tree to a set of unique sam ple ids (sam e form at as prior version 'environm ent file', and an (optional) category m apping file describing additional relationships between sam ples and subcategories for visualizations. For exam ple, in a given set of gut sam ples, you m ight define subcategories for different diets, different physical locations/dates, different species, and/or different treatm ents like antibiotics or high fat. For sam ple data click here. For citation, click here.

Both the UniFrac distance m etric and the P test can be used to m ake com parisons. Both of these techniques bypass the need to choose operational taxonom ic units (OTUs) based on sequence divergence prior to analysis.

Fast UniFrac allows you to:

Determ ine if the sam ples in the input phylogenetic tree have significantly different m icrobial com m unities.

Cluster sam ples to determ ine whether there are environm ental factors (such as tem perature, pH, or salinity) that group com m unities together.

Determ ine whether system under study was sam pled sufficiently to support cluster nodes.

Easily visualize the differences between sam ples graphically, with support for three dim ensional exploration of datasets and with m ultiple subcategory coloring.

Please enter your em ail and password to continue. After you register you will be able to analyze up to 100000 unique sequences, up to 200sam ples, and perform significance test based on up to 1000 tree perm utations.

If you wish to analyze m uch larger datasets than the defaults, please contact us and we will be happy to try to accom m odate you.

Fast UniFrac tutorial

陇西地震Introduction

动力环境监控This tutorial takes you through the steps of analyzing data in the Fast UniFrac web application. The

purpose of this tutorial is to show you how to use the interface to find the im portant variables for describing phylogenetic variation am ong your sam ples: in this case, to test what types of physical or chem ical factors are m ost im portant for structuring bacterial diversity. The dataset used in this tutorial includes 50 of the 464 sam ples analyzed in Ley, RE, Lozupone, CA, Ham ady, M, Knight, R and JI Gordon. (2008). Worlds within worlds: evolution of the vertebrate gut m icrobiota. Nat. Rev. Microbiol. 6(10): 776-88 (Pubm ed). It includes sequences from 16S ribosom al RNA surveys of diverse freeliving bacterial assem blages and the guts of diverse m am m als and term ites. At the end of this tutorial, you should be fully equipped to test hypotheses about your own sequences.

Also included in this tutorial are other exam ple files you m ay use to explore som e of the other features of Fast UniFrac.

Example data files

To use Fast UniFrac, you need three files: a tree file, a sam ple id m apping file, and a category m apping file. The tree file contains a phylogenetic tree, in Newick form at. The sam ple id m apping file contains a table showing how m any tim es each taxon (from the tree) occurred in each of your sam ples. The category m apping file contains additional m etadata about the sam ples, and is a table rel

ating each sam ple to param eters you have m easured such as tem perature, pH, etc. In general, people usually prepare the two m apping files using Excel, although it is im portant to save them as plain text form at and not as Excel docum ents.

You can either generate your own tree file, or use one of the reference trees. The PhyloChip reference tree m atches the probes on the PhyloChip and is useful for analyzing PhyloChip data; the Greengenes reference tree is from the Greengenes core set and is a phylogenetically diverse and representative set of bacteria. These trees are built using 16S rRNA, although you can use trees built from any m olecule, not just the 16S, or even trees constructed from m orphological or other data.

The sam ple id m apping file m ust be generated m apping the sequence ids in the tree file with the sam ple ids used in your study. In other words, exactly the sam e taxon nam es m ust be used in your tree and in your sam ple id m apping file.

The category m apping file m aps your sam ple ids to additional m etadata, such as subcategories, and sam ple descriptions. This file can be autogenerated but it is highly recom m ended that you generate one that is m eaningful for the variation you plan to exam ine in your studies. For exam ple,

if you were studying the effects of diet on the gut com m unities of conventional and hum anized m ice, you m ight want one colum n indicating whether the sam ple was from a conventional or a hum anized m ouse, another colum n indicating whether the m ouse was on a chow diet or a high-fat diet, another colum n containing the com bination of these two colum ns (i.e. diet and hum anized/conventional), etc.

In this section, several exam ple files are listed, not all of which are used in this tutorial.

Greengenes coreset reference datasets

This is the tree and the sequences m atching the Greengenes core set as of May 2009. These files are useful for m apping your sequences against known bacterial diversity.

1. Greengenes coreset tree (May 09)

2. Greengenes coreset fasta (May 09)

NRM data (demo subset)

These data are from the Ley et al. 2008 Nature Reviews Microbiology paper referenced above, and

provide an exam ple of m apping heterogeneous reads to the Greengenes core set tree so that the com m unities can be com pared by UniFrac. The sam ple ID m apping file was generated by blasting the dataset from the paper against the Greengenes_coreset_fasta file linked above, and the category m apping file was constructed m anually to provide a range of fine- and coarse-grained representations of the environm ental data.

1. Ley et al exam ple sam ple ID m apping file

2. Ley et al exam ple category m apping file

Example PhyloChip data

Exam ple data from Sagaram et al. 2009 AEM paper (Pubm ed) for use with PhyloChip reference tree.

1. Sagaram et al PhyloChip sam ple ID m apping file

2. Sagaram et al PhyloChip category m apping file

Crump et al data

These sequences are from Crum p et al. 1999 "Phylogenetic analysis of particle-attached and free-living bacterial com m unities in the Colum bia river, its estuary, and the adjacent coastal ocean", AEM 65:3192 (Pubm ed). This dataset was used in the original online UniFrac tutorial (Pubm ed)so are provided again here with two im portant changes. We provide an exam ple category m apping file that contains additional m etadata about each of the sam ples.

1. Crum p et al exam ple tree file

2. Crum p et al exam ple sam ple ID m apping file

3. Crum p et al exam ple category m apping file

Megablast protocol and sample mapping generation script

The application of UniFrac to large sequence sets, such as those generated with pyrosequencing, is also lim ited by the com putational power needed to m ake a de novo phylogenetic tree using standard m ethods, such as neighbor joining, likelihood, or parsim ony m ethods. In order to prepare phylogenetic trees for input into UniFrac from very large datasets, we recom m end using QIIME. The best source for inform ation about QIIME are the website and the QIIME paper, which you can get at the following links:

1. Source code

2. QIIME allows analysis of high-throughput com m unity sequencing data

The quickest way to get started with QIIME is using the virtual m achine.

One potential workflow for working with ñarge datasets is to use QIIME to:

1. Preprocess sequences to handle low quality reads

2. Select OTUs

3. Generate a phylogenetic tree , and then use the QIIME script convert_otu_table_to_unifrac_sam ple_m apping.py, to generate the proper input files for

the Fast UniFrac web interface.

In the initial release of Fast UniFrac, we also described the following procedure for generating a phylogenetic tree, which is based on m apping sequences to their closest relative in a reference tree using BLAST. This functionality is now in QIIME, and we recom m end using QIIME for this step, but retain this docum entation below for those who m ay still be interested in using it

The BLAST to greengenes protocol

We illustrate that the analysis of such large sequence sets can be carried out by assigning them to their closest relative in a phylogeny of the Greengenes core set (DeSantis et al., 2006) using BLAST’s m egablast protocol (Altschul et al., 1990). Below is a detailed protocol for carrying out this analysis. Note that

a different BLAST database can be substituted for use with any reference tree.

1. Create the Greengenes BLAST database:

This link is a fasta file containing the sequences from the greengenes coreset. This fasta record can be form atted into a BLAST database using the com m and:

f o r m a t d b-i G r e e n G e n e s C o r e-M a y09.r e f.f n a-p F-o F-n

g g_c o r e s e t

2. Perform the megablast search:

A fasta record of your sam ples can be BLASTed against the gg_coreset BLAST database created in step 1 using the following com m and:

b l a s t a l l-p b l a s t n-n T-d g g_

c o r e s e t-i-e1e-30-b5-m9-o b l a s t_o u t p u t.t x t

3g安全网Note that the -m 9 flag is essential because it specifies the hit table output form at that the script below requires.

Also note that the sequence nam es m ust conform to the following form at:

s a m p l e N a m e D e l i m i t e r s e q u e n c e I d

For instance, if you sequenced 2 clones from each of two sam ples nam ed SA and SB, valid sequence nam es m ight be:

S A#01

S A#02

S B#01

S B#02

If you have not nam es the sequences according to this convention, it is possible to also use a m apping file describing which sequence is from which sam ple. See docum entation within the code for m ore details on this.

3. Use this python script and the BLAST output from step 2 to create an environment file that can be used with UniFrac:

Note that the PyCogent toolkit m ust be downloaded from SourceForge and the cogent directory should be on your PYTHONPATH.

You can then use the code as follows:

p y t h o n c r e a t e_u n i f r a c_e n v_f i l e_B L A S T.p y<b l a s t_o u t p u t.t x t><o u t f i l e_p a t h.t x t><s a m p l e_n a m e_d e l i m i t e r>

: Path to the hit tables from the BLAST searches

: Path to where the environm ent file will be saved

sam ple_nam e_delim iter: A delim iter (e.g. a #) that separates the sam ple nam e from the sequence id.

大学二手书交易平台

Steps

1. Create a phylogenetic tree containing sequences from samples that you would like to compare, or select a reference tree.

The tree should be rooted, and m ust have branch lengths to use Fast UniFrac. Typically, the tree is rooted by including an outgroup, e.g. an archaeal sequence to root the bacteria, but we som etim es use m idpoint rooting as well. If an unrooted tree is supplied, UniFrac will assign a root arbitrarily. If you have extra sequences in the tree that are not annotated by sam ple, they will autom atically be rem oved from the tree when you upload the file, so the outgroup will not be included in the analysis. If no sequences appear in the tree after upload, the m ost likely problem is that there was an issue with your sam ple ID m apping file (for exam ple, you m ight have used GenBank identifiers in the tree, but NCBI GIs in the sam ple ID m apping file, which wouldn't m atch each other).

There are m any different program s that you can use for sequence alignm ent and/or the phylogeny include the NAST alignm ent tool, PyNAST, FastTree, ARB, ClustalW, MUSCLE, PHYLIP, PAUP, or MrBayes. For 16S rRNA sequences, we prefer PyNAST for alignm ent. For generating trees from large dataset, we prefer FastTree for de novo tree generation trees or m apping sequences to their closest relative in a reference tree. These preferred options as well as several others can be run using QIIME. For large datasets, it is greatly preferred to select OTUs prior to the alignm ent and tree building step. This cuts down on the com putation tim e and does not have an effect on the results. Because UniFrac depends on branch lengths, it is im portant to look at your tree to ensure that you don't see long branches that result from m isalignm ent rather than from long periods of evolution. At the end of this process, you can export the tree in Newick form at for upload into the UniFrac interface.

Alternatively, you can choose one of the reference trees provided and m ap your sequences to this tree. This can be useful, particularly for large datasets, such as those produced by 454 pyrosequencing, since creating a single phylogenetic tree with all sequences m ay not be feasible with the program s listed above. One sim ple way to m ap your sequences onto their closest relatives in a reference tree is use m egablast. In this tutorial, the original sequences from the NRM

paper were assigned to their closest hit in the 11-Aug_2007 version of the greengenes coreset (can be downloaded from v/Download/Sequence_Data/Fasta_data_files/). Sequences with no hit or that m atch with an e-value greater than e-50 were dropped from this exam ple dataset.

For the purpose of this tutorial, we provide the greengenes coreset tree in Newick form at that we exported from an arb database that is available for download at v/Download/Sequence_Data/Arb_databases/ A sm all num ber of sequences were added to this tree using parsim ony insertion in arb so that the fasta data files and tree for the core set were in sync. The resulting tree (Greengenes coreset tree (May 09)) and corresponding sequences (Greengenes coreset fasta (May 09)) can downloaded, but please note that this tree can be im ported to your history and does not need to be re-uploaded. In order to im port the GreenGenes reference tree to your history follow these steps:

1. In the upper m enu, go to Shared Data - Data Libraries:

2. Then, select 'GreenGenes coreset tree (May 09):

3. Click on the checkbox next to '' and, finally, on the 'Go' button:

4. The reference tree is now in your history and you can use it.

2. Create a sample ID mapping file.

This file m aps each sequence ID in the tree to the sam ple ID that it cam e from. This m ust be done m anually (or via a script): for each sequence, type the sequence ID used in the tree, then a tab, then the sam ple ID that it com es from, then optionally, another tab and then the num ber of tim es each sequence was observed (sequence abundance).

The sequence abundance colum n is im portant if you have dereplicated the sequence data in any way (e.g. choosing OTUs and only including a representative sequence in the tree, rem oving exact duplicate sequences, or pre-screening clones using RFLP patterns prior to sequencing), and you are planning on using tools in the interface that consider differences in relative abundance (e.g. weighted UniFrac). It is fine to use a tree and sam ple ID m apping file with all of the sequences (e.g. 5 duplicate sequences in the tree each with a weight of 1 rather than 1 representative sequence with a weight of 5) and to perform abundance-based analyses, although dereplicating the data will allow you to process larger datasets.

For PCoA analysis, it is m ost convenient to nam e each environm ent so that sam ples of the sam e

type have nam es that start with the sam e first 1, 3, or 5 letters or that have sam ple types followed by a period, hash, or plus character (this allows you to apply colors in the PCoA scatterplots later).

In this exam ple, there are 50 bacterial sam ples from the following sam ple types: Surface and subsurface saline water (Sws, and Swb respectively), Nonsaline water (Nw), Saline sedim ents (Sse), Nonsaline sedim ents (Nsa), Soils (Nso), the Vertebrate gut (Vg) and the Term ite gut (Tg). We'll label each sam ple with its 2-3 letter sam ple code, followed by a hash, and a unique num ber because our hypothesis is that the organism s from the sam e overall environm ent should be m ore sim ilar to one another.

The following is a short snippet of a sam ple ID m apping file. The first colum n is the sequence ID, the second colum n is sam ple ID, and the last colum n is the num ber of tim es the sequence was observed.

150394T g#12491

150394T g#12512

215260N s o#651

215260N s o#1294

16073V g#h#111

...

For the purpose of this tutorial, we provide a sam ple ID m apping file called fastunifrac_Ley_et_al_NRM_2_sam ple_id_ip sam ple ID m apping file.

3.Create a category mapping file.

The category m apping file relates sam ple nam es in the sam ple ID m apping file to their related m eta data (defined via subcategory colum ns) and descriptions of where the sam ples cam e from. The descriptions can be accessed throughout the results interface in order to m ake them easier to interpret. The subcategory colum ns allow for dynam ic coloring of PCoA results in the 3d viewer to determ ine which categories are related to which principal coordinate axes.

For the purpose of this tutorial, we provide a category m apping file called Ley et al exam ple category m apping file with 4 subcategory colum ns that define for each sam ple (1) which sam ple type it is from(EnvType), (2) whether the sam ple cam e from a freeliving bacterial assem blage or fr

om the gut (FreelivingGut), (3) whether the freeliving com m unities were saline or nonsaline (SalineNon), and whether they were from aquatic (Water) or "Particulate" sam ples such as soils and sedim ents (WaterPartic). There is also a short description of each sam ple in the final colum n.

The file form at is tab-delim ited text. The first line is a header line that m ust start with a "#" character.

Optionally, a general description of the input files can be included in the lines im m ediately following the header line that start with a "#". This description will be included in the upload and results screens so that relevant inform ation can be easily accessed.

The first colum n m ust be nam ed Sam pleID, m ust contain unique (short, m eaningful) sam ple IDs containing only alphanum eric characters. (With the exception of ".", "+", and "#" characters.)金泰勋

The second colum n to "n-1 th" colum n are subcategories. These can be anything (random assignm ent if you want) but each subcategory should a sm all num ber of distinct values <= num ber of sam ples. There m ust be at least two unique values for each category.

The last colum n m ust be nam ed "Description" and contains the short descriptions for the sam ples.

#S a m p l e I D E n v T y p e F r e e l i v i n g G D e s c r i p t i o n

#G e n e r a l d e s c r i p t i o n o f a n a l y s i s l i n e1(o p t i o n a l)

#G e n e r a l d e s c r i p t i o n o f a n a l y s i s l i n e2(o p t i o n a l)

#...

T g#1249T e r m i t e G u t G W h o l e g u t o f t h e w o o d-f e e d i n g t e r m i t e

T g#1251T e r m i t e G u t G W h o l e g u t o f t h e f u n g u s-g r o w i n g t e r m i t e M a c r o t e r m e s g i l v u s

N s o#65S o i l F r e e l i v i U n c u l t i v a t e d a g r i c u l t u r a l s o i l i n W i s c o n s i n

N s o#1209S o i l F r e e l i v i S o i l f r o m a f e r t i l i z e d S w i t z e r l a n d p l o t i n t h e D O K.

V g#h#111V e r t e b r a t e G u t G F e c e s f r o m A n g o l a n C o l o b u s M o n k e y f r o m t h e S t L o u i s Z o o.

For the purpose of this tutorial, we provide a category m apping file called fastunifrac_Ley_et_al_NRM_3_category_ip sam ple ID m apping file.

4. Go to the Fast UniFrac web site.

If you're reading this tutorial, you already know how to get here. You will need to register and log in to com plete the tutorial, because we restrict the num ber of sequences that unregistered users can analyze. The reason for this is that m any of the analyses are com putationally expensive, so we need to keep track of which groups are using a lot of resources to ensure fair access for everyone. Please note that if you have previously registered for the original UniFrac interface, you will have to contact m icrobiom ehelp@colorado.edu to register for FastUniFrac. We apologize for this inconvenience.

5. The Fast UniFrac upload screen

After you have logged in, you have to upload your sam ple ID m apping file and your category m apping file. To get to the upload page, click 'Get data' on the Tools panel and then 'Upload file':

答韦中立论师道书Then, the upload page will appear:

First, upload your sam ple ID m apping file. Click 'Browse' below where it says File, and navigate to your sam ple ID m apping file (in this case, fastunifrac_Ley_et_al_NRM_2_sample_). One com m on problem is that you m ight have your sam ple ID m apping file saved as a Word docum ent: this will NOT work, because Word uses a proprietary file form at that is difficult for other program s to read. If you are saving your sam ple ID m apping file from Word, rem em ber to save it as Plain Text, NOT as Microsoft Word. If you are using Excel, save as Tab-delim ited Text. At the end of this

process, your screen should look like this:

state - blue color) will appear in the history panel:

While the sam ple ID m apping is uploading, you can start with the category m apping file upload. In order to upload your category m apping file follow the above steps, but now navigate to your category m apping file (in this case, fastunifrac_Ley_et_al_NRM_3_). This file is m ost easily created in Excel, rem em ber to save as Tab-delim ited text.

If you have your own tree file, you can upload it following these sam e steps. In this tutorial, we will use the 'GreenGenes Core - May 2009' tree, which is already on the system.

Once all the files are uploaded (the datasets in the history panel are in green color) you can start any of the available analysis in Fast UniFrac.

6. Measuring the overall difference between each pair of samples.

In order to generate the raw distances between each pair of sam ples using the UniFrac m etric, first choose the Sample Distance Matrix option from the Tool panel, under the 'Fast UniFrac' section.

On the Sam ple Distance Matrix page you can select the reference tree, sam ple ID m apping file and the category m apping file you want to use to perform the analysis. First, select the 'GreenGenes Core - May 2009' tree using the drop-down m enu below 'Select reference tree'. Next, select the '1: fastunifrac_Ley_et_al_NRM_2_sam ple_id_' file and the '2: fastunifrac_Ley_et_al_NRM_3_category_' file using the drop-down m enus below 'Select sam ple ID m apping file' and 'Select category m apping file', respectively. If you then click the 'Execute' button, you will get a m essage saying that your job has been subm itted to the queue, and two new datasets will appear in the History panel. When the datasets are green (tim e depending on server load) you can view them clicking on the eye icon. The first dataset will display a screen like the following. containing the distance m atrix that relates each pair of environm ents:

本文发布于:2024-09-22 06:55:08，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/229424.html

上一篇：Pro-E 3D绘图高级篇教材

下一篇：vray渲染测试参数

标签：监控陇西大学

留言与评论（共有 0 条评论）