seqprofile

Calculate sequence profile from set of multiply aligned sequences

Syntax

Profile = seqprofile(Seqs)
[Profile, Symbols] = seqprofile(Seqs)
seqprofile(Seqs, ...'Alphabet', AlphabetValue, ...)
seqprofile(Seqs, ...'Counts', CountsValue, ...)
seqprofile(Seqs, ...'Gaps', GapsValue, ...)
seqprofile(Seqs, ...'Ambiguous', AmbiguousValue, ...)
seqprofile(Seqs, ...'Limits', LimitsValue, ...)

Arguments

Seqs

Set of multiply aligned sequences represented by any of the following:.

  • Character array

  • Cell array of character vectors

  • String vector

  • Array of structures containing the field Sequence

AlphabetValue

Character vector or string specifying the sequence alphabet. Choices are:

  • 'NT' — Nucleotides

  • 'AA' — Amino acids (default)

  • 'none' — No alphabet

When Alphabet is 'none', the symbol list is based on the observed symbols. Each character can be any symbol, except for a hyphen (-) and a period (.), which are reserved for gaps.

CountsValue

Controls returning frequency (ratio of counts/total counts) or counts. Choices are true (counts) or false (frequency). Default is false.

GapsValue

Character vector or string that controls the counting of gaps in a sequence. Choices are:

  • 'all' — Counts all gaps

  • 'noflanks' — Counts all gaps except those at the flanks of every sequence

  • 'none' — Default. Counts no gaps.

AmbiguousValue

Controls counting ambiguous symbols. Enter 'Count' to add partial counts to the standard symbols.

LimitsValue

Specifies whether to use part of the sequence. Enter a [1x2] vector with the first position and the last position to include in the profile. Default is [1,SeqLength].

Description

Profile = seqprofile(Seqs) returns Profile, a matrix of size [20 (or 4) x SequenceLength] with the frequency of amino acids (or nucleotides) for every column in the multiple alignment. The order of the rows is given by

  • 4 nucleotides — A C G T/U

  • 20 amino acids — A R N D C Q E G H I L K M F P S T W Y V

[Profile, Symbols] = seqprofile(Seqs) returns Symbols, a unique symbol list where every symbol in the list corresponds to a row in Profile, the profile.

seqprofile(Seqs, ...'PropertyName', PropertyValue, ...) calls seqprofile with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

seqprofile(Seqs, ...'Alphabet', AlphabetValue, ...) selects a nucleotide alphabet, amino acid alphabet, or no alphabet.

seqprofile(Seqs, ...'Counts', CountsValue, ...) when Counts is true, returns the counts instead of the frequency.

seqprofile(Seqs, ...'Gaps', GapsValue, ...) appends a row to the bottom of a profile (Profile) with the count for gaps.

seqprofile(Seqs, ...'Ambiguous', AmbiguousValue, ...) when Ambiguous is 'count', counts the ambiguous amino acid symbols (B Z X) and nucleotide symbols (R Y K M S W B D H V N) with the standard symbols. For example, the amino acid X adds a 1/20 count to every row while the amino acid B counts as 1/2 at the D and N rows.

seqprofile(Seqs, ...'Limits', LimitsValue, ...) specifies the start and end positions for the profile relative to the indices of the multiple alignment.

Examples

collapse all

Create an array of structures representing a multiple alignment of amino acids:

seqs = fastaread('pf00002.fa');

Return the sequence profile and symbol list from position 50 through 55 of the set of multiply aligned sequences, counting all gaps.

[Profile2,Symbols2] = seqprofile(seqs,'limits',[50 55],'gaps','all')
Profile2 = 21×6

    0.0312    0.0312    0.1562    0.4375    0.1250    0.2188
         0         0    0.3750         0         0         0
         0         0    0.0938    0.1562         0         0
         0         0         0    0.0312         0         0
         0    0.0625         0         0    0.0312         0
         0         0         0    0.0312         0         0
         0         0         0    0.1250         0         0
    0.0312         0    0.0625         0         0         0
         0         0         0         0         0         0
    0.4688    0.0625         0         0    0.3125    0.1562
      ⋮

Symbols2 = 
'ARNDCQEGHILKMFPSTWYV-'
Introduced before R2006a