Welcome to BaMMServer¶
What is the BaMM webserver?¶
Bayesian Markov Models (BaMMs) have shown to outperform simple position weight matrices (PWMs) in learning regulatory motifs from next-generation sequencing data [SSoding16].
BaMM webserver is a web resource for discovery of regulatory motifs with higher-order Bayesian Markov Models (BaMMs).
It supports four workflows for integrated Motif analyses:
- Discovering higher-order motifs in a nucleotide sequence sets
- Search sequences for motif occurrences
- Browsing and text searching higher-order model databases
- Comparing motifs to our higher-order databases
BaMM webserver is designed to make the power of higher-order motif analysis accessible in common motif analysis tasks.
Optimal input data¶
- The BaMM webserver works best with:
- 1000-10000 short (up to 250nt) nucleotide sequences enriched with motifs in fasta format.
- Sequences derived from ChIP-seq, CLIP-seq, HT-SELEX, or similar techniques.
- Sequences that passed quality control and are preselected for bound sequences (see also How do I prepare my ChIP-seq data?)
- If you intend to …
- … submit long sequences, please have a look at: Can I use the server with long sequences (>250bp)?
- … submit very few sequences, please have a look at: Can I use the server with very few sequences?
Understanding BaMMs¶
Higher-order Markov models are by no means an invention of the Soedinglab. The theory is in the literature for almost as long as PWMs. However higher-order models have a huge disadvantage: The number of parameters increases exponentially with the order of the model. This makes it difficult to train these models robustly with limited data.
BaMMs introduce a novel choice of the regularization for preventing overfitting. They are constructed in a way that lower orders act as priors on higher orders. This way higher-order parameters are pruned if not supported by enough evidence.
The details are described in the following scheme.

Bayesian Markov model training automatically adapts the effective number of parameters to the amount of data.
In the last line, if the context GCT is so frequent at position j in the motif that its number of occurrences outweighs the pseudo-count strength, \(n_j(GCT) \gg \alpha_3\), the third-order probabilities for this context will be roughly the maximum likelihood estimate, e.g. \(p_j(A|GCT) ≈ n_j(GCTA)/n_{j−1}(GCT)\).
However, if few GCT were observed in comparison with the pseudo-counts, \(n_j(GCT) \ll \alpha_3\) , the third-order probabilities will fall back on the second-order estimate, \(p_j(A|GCT) \approx p_j(A|CT)\). If also \(n_j(CT) \ll \alpha_2\), then likewise the second-order estimate will fall back on the first-order estimate, and hence \(p_j(A|GCT) \approx p_j(A|T)\).
In this way, higher-order dependencies are only learned for the fraction of k-mer contexts that occur sufficiently often at one position j in the motif’s training instances to trump the pseudo-counts. Throughout this work we set \(\alpha_0 = 1\) and \(\alpha_k = 20 × 3^k − 1\).
Available workflows¶
The BaMM webserver offers four workflows for analyzing regulatory motifs. In the following the worflows are described in detail.

BaMM webserver workflows
De-novo motif discovery¶

For our higher-order BaMM models, de-novo motif discovery takes place in two stages: seeding and motif refinement. In the seeding stage, we use PEnG-motif, a very fast algorithm for finding enriched IUPAC base patterns which we optimize to seed PWMs. In the refinement stage, the PWM seeds are then optimized to higher-order BaMMs.
The default de-novo workflow hides the seeding phase from the user. The best-performing seeds are automatically selected for higher-order refinement. We also offer a second, slightly more complex workflow, that reports possible seed PWMs and lets the user choose seeds for refinement. You can access it by clicking following button on the bottom of the de-novo motif discovery workflow:

Usage¶
In its simplest form, the de-novo motif discovery workflow requires a fasta file with sequences and reports up to four higher-order models.
By clicking on the Advanced 0ptions
, the user can choose a wide variety of additional settings and parameters organized into four subgroups: general settings, seeding stage, refinement stage and settings for further analyses.
General settings¶

- Search on both strands
- if unchecked, motifs can not lie on the reverse complemented strand. Searching on the PLUS strand only can be useful for stranded data, such as RNA.
- Background Sequences
- by default the background model is learnt as a homogeneous Markov model on the input sequences. If you have a separate negative set, you can upload it as a fasta file here.
- Background Model Order
- sets the order of the background model. The higher the background order, the more realistic the background model. We recommend order 2 for ChIP-seq data. For very short motifs (e.g.) RNA binding motifs, order of 1 or 0 may be necessary to detect the motif.
Seeding stage¶

- Pattern Length
- The length W of patterns on the sequences to be searched.
- Z-Score Threshold
- Only W-mers which surpass this z-score threshold will be considered for seed optimization.
- Count Threshold
- Only W-mers that surpass this count threshold will be considered for seed optimization.
- IUPAC Optimization Score
Scoring function that is optimized in IUPAC pattern generation. Currently there are three options:
- LOGPVAL: optimize to IUPAC pattern with the lowest p-value
- MUTUAL_INFO: optimize to IUPAC pattern that has the highest mutual information between presence of a motif and being a positive sequence
- ENRICHMENT: optimize to IUPAC pattern with the highest enrichment over negative sequences
- Skip EM
- When unchecked, the seeds are not optimized with the Expectation-Maximization (EM) algorithm.
- Number of optimized seeds
- Up to this amount of seeds are refined to higher-order models.
Refinement stage¶

- Model Order
- order of the Markov model. Models with high orders are more time consuming to train.
- Flank extension
- extend the core seed by extra positions to the left and the right. Can be used to learn weakly informative flanking regions.
Settings for further analyses¶

- Run motif scanning
- uncheck to skip scanning the input sequences for motif occurrences.
- Motif scanning p-value cutoff
- p-value cut-off for calling a position a binding site.
- Run motif evaluation
- uncheck to skip motif performance evaluation.
- Run motif-motif compare
- uncheck to skip motif annotation with models from one of our databases
- MMcompare e-value cutoff
- e-value cutoff for reporting motif-motif matches with our motif database
Motif scan¶

Motif scan takes a motif and a set of sequences and predicts binding positions of the motif. The uploaded motif can be either in MEME-format (>= version 4) or in BaMM format.
When scanning with a BaMM motif two files are required: A BaMM model (extension *.ihbcp
) and its corresponding background frequencies (extension *.hbpc
).
By default the performance of the motifs on the input set is evaluated. Optionally the motifs can also be annotated with one of our motif databases.
Please refer to Usage for a detailed description of the advanced parameter settings.
Motif database¶

Our motif databases consists of over 1000 4th-order BaMM, trained on ChIP-seq peaks collected by the GTRD project [YSV+16].
The BaMMs fall into following sub-collections:
- 613 motif models for
Homo sapiens
(human) - 354 motif models for
Mus musculus
(mouse) - 19 motif models for
Rattus norvegicus
(rat) - 16 motif models for
Danio rerio
(zebrafish) - 34 motif models for
Schizosaccharomyces pombe
(yeast) - 360 motif models for
Drosophila melanogaster
(fly) based on ModERN [KVG+17]
Warning
Please be aware that the BaMM databases are automatically generated. While comparison against manually curated databases showed that they are generally of high quality, sometimes we also learn co-factors or a combination of protein-of-interest and cofactors.
For users relying on accurate motif annotation we also offer the manually curated PWM databases JASPAR Core [KFS+17] and HOCOMOCO [KVY+18].
Motif-motif comparison¶

The motif-motif comparison tool allows to search with a motif in MEME or BaMM format against a motif subcollection of our database. The e-value is the only configurable parameter.
Understanding the output¶
The BaMM webserver offers a wide variety of analyzes and plots. In this section we try describe in detail what they show and how it can be used and interpreted in your own analysis.
Motif overview¶
The result start with an overview table of the motifs.

For each motif, it shows the IUPAC sequence, the 0th order motif representation (PWM) and if available the estimated performance of the motif and fraction of sequence that contain the motif.
The motif and all analyzes can be downloaded by clicking the button in the last column.
Sequence logos¶
We developed sequence logos for higher orders to visualise the BaMMs.
For this we split the relative entropy
into a sum of terms, one for each order. The logos show the amount of information contributed by each order over and above what is provided by lower orders, for each kmer and position.
In the 0th-order sequence logo, the height of the four bases on each column is determined by their relative frequencies. More frequent bases are depicted on top of less frequent bases. Consequently, the consensus sequence can be assembled from the top bases, while the vertical order of bases in each column corresponds to their order of predominance.
This 0th-order sequence logo was designed to reflect the characteristics of the PWM model and has been widely used. However, it is not suited to illustrate dependencies between binding site positions.
We therefore also provide higher-order logos. In the higher-order logos, the height of both columns and k-mers corresponds to the contribution to the information content that is not yet described in a lower order, in other words, the information you can gain by taking into consideration of the dependencies between positions in the motif. Note that k-mers can exhibit negative contributions to the information content.

AvRec evaluation¶
Why yet another evaluation metric?¶
Various metrics are used for describing the quality of a motif, most noteably p-values, Area Under the Receiver Operating Characteristic curve (AUROC), Area Under the Precision Recall curve (AUPRC). All of these methods have problems capturing the potential biological relevance of the motif. For a more detailed explanation, please refer to [KRG+18].
We sought to develop a motif performance evaluation that
- covers the range of False discovery rates (FDR) most relevant in practical applications
- allows the user to easily estimate the performance of the motif in her particular application
The method¶
We generate background sequences from a second-order Markov model trained on all input sequences and evaluate how well the motif separates positive sequences from the negative (=background) sequences. We define true positives (TP) as correct predictions, false positives (FP) as false predictions, and false negatives (FN) as positive test cases that have not been predicted. The precision is defined as the fraction of predictions that are correct, TP/(TP+FP), and the recall (= sensitivity) is the fraction of true motif instances that are actually predicted, TP/(TP+FN).
The TP/FP ratio is calculated for a positives:negative ratio of 1:1. We plot the recall-TP/FP ratio curve, with TP/FP ratio plotted on a logarithmic y-scale, (between 1 and 100)
The area under this curve is the average recall (AvRec) in the regime of relevant FDR values.
Dataset AvRec vs. motif AvRec¶
We give two different version of the AvRec score. The dataset AvRec and the motif AvRec. For the dataset AvRec we assume that all sequences of the input set are positive, meaning that all of them are bound. If only a fraction of the sequences carry the motif the dataset AvRec will severely underestimate the motif performance as unbound sequences will behave as the background sequences.
For the motif AvRec we estimate the number of sequences carrying the motif with the fdrtool [Str08]. By this strategy, we label as positive only the sequences that carry a motif.
Dataset AvRec and motif AvRec have the same value if all sequences are estimated to carry a motif.
The performance plots¶

We show four performance plots: the distribution of p-values calculated under the ZOOPS model and its respective recall vs. TP/FP ratio curve for the dataset AvRec (left column) and the motif AvRec (right column).
P-value distribution plots¶
The p-values are calculated on the joint set of background sequences and input sequences. The p-value is calibrated such that it is uniform over the background sequences (shown as a gray shaded rectangle). The input sequences will carry the motif more often than random and therefore enrich for low p-values. Input sequences that do not carry a motif have uniformly distributed p-values.
For the motif AvRec analysis the p-value distribution plot also contains an orange dashed line that separates the sequences with motif (above the line) from sequences without motif (below the line). If all input sequences carry the motif the orange line will coincide with the fraction of positive sequences, shown here:

As described above, in this case the dataset AvRec is equal to the motif AvRec.
Recall vs. TP/FP ratio plots¶
The recall vs. TP/FP ratio plots are calculated from the p-value distribution plots. The TP/FP ratio axis is depicted in logscale, ranging from 1 to 100. The TP/FP ratio \(R\) is directly related to the FDR via
The highest point on the y axis (R=100) therefore relates to an FDR of 1/101, the lowest to an FDR of 0.5.
We generate three blue lines for different ratios of positive to negative ratios. The solid blue line represents the same number of positive and negative sequences (ratio 1:1). We also draw two dashed lines for the ratios 1:10 and 1:100. Note that depending on the motif quality not all of the lines may be visible.
We define the AvRec score as the area under the solid curve (1:1) case. It is also given at the top of the plot.
Note: in the recall vs. TP/FP ratio plot, the line representing positive/negative ratio of 1:10 is the 1:1 curve shifted down by one unit (\(\log_{10}(10) = 1\)). This allows to estimate the motif performance for your own exected ratio of positive to negatives. All you have to do is shifting the 1:1 curve according accordingly!
Motif distribution plot¶

The motif distribution plot shows the distribution of motif occurences over the input sequences relative to the middle of the sequences.
In a ChIP-seq experiment primary motifs should have a higher enrichment around the middle of the sequence. Factors of co-binding motifs often show a less clear positional preference.
The plot can be influenced by varying the Motif scan p-value cut-off
.
When setting to a low p-value, only highly significant motif positions are used for generating this plot.
Motif-motif comparison¶
Workflows that use Motif-motif compare to annotate motifs with a collection of motifs in our database will produce a result similar to this.

The results are sorted by significance, given by the e-value score. The e-value is the expected number of hits when searching a scrambeled motif against the database.
The button in the last column can be used to find detailed information for the motif in our database. From there the motif can be used to scan your sequences for occurences.
File formats¶
Fasta¶
BaMM webserver accepts sequences in FASTA format. Only nucleotide sequences with the letters A
, C
, G
, T
, and N
are accepted.
BaMM¶
The motif model in BaMM-format is a file with extension .ihbcp
(inhomogeneous bamm conditional probability)
It stores the conditional probabilities of the BaMM model for each position. Motif positions are separated by a new line.
Here is an example of BaMM files for a 2nd order motif of length W:
Motif model (extension: .ihbcp)
P_1(A) P_1(C) P_1(G) P_1(T)
P_1(A|A) P_1(C|A) P_1(G|A) P_1(T|A) P_1(A|C) P_1(C|C) ... P_1(T|T)
P_1(A|AA) P_1(C|AA) P_1(G|AA) P_1(T|AA) P_1(A|AC) P_1(C|AC) ... P_1(T|TT)
P_2(A) P_2(C) P_2(G) P_2(T)
P_2(A|A) P_2(C|A) P_2(G|A) P_2(T|A) P_2(A|C) P_2(C|C) ... P_2(T|T)
P_2(A|AA) P_2(C|AA) P_2(G|AA) P_2(T|AA) P_2(A|AC) P_2(C|AC) ... P_2(T|TT)
...
P_W(A) P_W(C) P_W(G) P_W(T)
P_W(A|A) P_W(C|A) P_W(G|A) P_W(T|A) P_W(A|C) P_W(C|C) ... P_W(T|T)
P_W(A|AA) P_W(C|AA) P_W(G|AA) P_W(T|AA) P_W(A|AC) P_W(C|AC) ... P_W(T|TT)
Where P_W(A|CT)
is the conditional probability of observing A
at motif position W
following the context CT
.
Background model (extension: .hbcp)
P(A) P(C) P(G) P(T)
P(A|A) P(C|A) P(G|A) P(T|A) P(A|C) P(C|C) ... P(T|T)
P(A|AA) P(C|AA) P(G|AA) P(T|AA) P(A|AC) P(C|AC) ... P(T|TT)
Where P(A|CT)
is the conditional probabilty of observing an A
following a CT
context. P
is trained on the negative sequence set if available. If no negative sequences are provided, P
is learnt on the positive set.
MEME¶
PWM models can be uploaded in meme MEME Versions lower than MEME version 4 have not been tested and are thus not recommended.
This is an example MEME file generated by PEnG-motif. Note that the lines below MOTIF
are providing additional annotation and can vary between tools and databases.
MEME version 4
ALPHABET= ACGT
Background letter frequencies
A 0.25864 C 0.240258 G 0.241035 T 0.260067
MOTIF TGASTCATCSC
letter-probability matrix: alength= 4 w= 11 nsites= 32240 bg_prob= 0 opt_bg_order= 2 log(Pval)= -20070.6 zoops_score= 0.763 occur= 0.939
0.00000011 0.00000020 0.00000005 0.99999958
0.00000019 0.00000019 0.99973792 0.00026177
0.99776745 0.00222652 0.00000086 0.00000516
0.00043767 0.31039140 0.68885416 0.00031674
0.00000172 0.00001118 0.00000463 0.99998242
0.00015724 0.99983966 0.00000142 0.00000168
0.99997258 0.00000054 0.00002521 0.00000166
0.00000828 0.25723305 0.00413273 0.73862594
0.02208222 0.92982459 0.00702223 0.04107105
0.16592142 0.34808874 0.35102692 0.13496293
0.07382397 0.51519489 0.17385206 0.23712915
MOTIF ATTRTTTGTTTT
letter-probability matrix: alength= 4 w= 12 nsites= 13728 bg_prob= 0.0 opt_bg_order= 2 log(Pval)= -893.0211792 zoops_score= 0.252 occur= 0.621
0.68648666 0.00624365 0.03511349 0.27215624
0.27477601 0.00371415 0.06135688 0.66015303
0.00009623 0.00107756 0.00017856 0.99864769
0.56885940 0.00127072 0.42884308 0.00102682
0.00040802 0.00205148 0.00037785 0.99716270
0.00159969 0.00023089 0.00047653 0.99769294
0.00016588 0.00006246 0.01915511 0.98061651
0.24886248 0.01232569 0.70805818 0.03075366
0.00018377 0.14974646 0.01920011 0.83086962
0.08978166 0.01159330 0.00815281 0.89047223
0.00074780 0.00028864 0.00068021 0.99828333
0.27042452 0.00127012 0.01194946 0.71635598
Motif occurrence¶
We store motif occurences in a file with extension .occurrence.
Occurrence files have 7 columns:
- seq
- the sequence identifier in the uploaded fasta file
- length
- the length of the fasta sequence
- strand
- whether the motif was found on the positive (
+
) or reverse complemented (-
) strand. - start..end
- the relative position of the motif in the sequence
- pattern
- the nucleotide sequence of the motif in the sequence
- p-value
- the estimated p-value of the motif occurrence
- e-value
- the estimated e-value of the motif occurence
This is an example of an occurrence file:
seq length strand start..end pattern p-value e-value
>chr5:119672047-119672247 209 + 23..31 GGCAGCTGT 0.00045 0.225
>chr9:21950422-21950622 209 + 23..31 AGCAGCTGC 4.78e-05 0.0239
>chr7:6410115-6410315 209 + 101..109 GGCACCTGC 0.0001 0.0502
Motif evaluation¶
The motif evaluation scores are stored in a file with extension .bmscore.
*.bmscore
files have 6 columns:
- TF
- base name of the sequence data file
- #
- number of the motif
- d_avrec
- data set AvRec score - a score indicating how well the motif can distinguish input sequences from articially generated sequences
- d_occur
- fraction of sequences with a motif in the data set setting (see explanation above)
- m_avrec
- motif AvRec score - a score indicating how well the motif can distinguish sequences with a motif from artificially generated sequences or input sequences without a motif.
- m_occur
- fraction of sequences with a motif in the input set
You can find a detailed definition and discussion of the AvRec score and the difference between dataset and AvRec and motif AvRec, in the webserver publication [KRG+18]
This is an example of an .bmscore
file for a dataset with three motifs:
TF # d_avrec d_occur m_avrec m_occur
JUN_D 1 0.668 0.552 0.705 0.948
JUN_D 2 0.367 0.328 0.383 0.958
JUN_D 3 0.161 0.874 0.392 0.408
FAQ¶
- I think I found a bug, how can I make you aware?
- The best way is to file an issue in our github repository. Additionally you can write an email to bamm@mpibpc.mpg.de. In any case, please provide as much information as possible for us to reproduce the bug, e.g. the link to the result page.
- How long are the results available on the server?
- We guarantee that the results will be accessible via job id for at least 3 months.
- What is the maximum size of files I can upload?
You can upload files with up to 50 MiB in size.
For larger sequence files, you can either use our commandline tools, or run the webserver locally after adapting the
MAX_UPLOAD_FILE_SIZE
configuration option.You can find detailed instructions in the README. in the webserver’s github repository.
How do I prepare my ChIP-seq data?¶
ChIP-seq produces regions in the genome that are bound by the factor of interest. The genome however is full of short repeats that due to their high occurrences and informativeness can easily overwhelm the signal of the true binding motif. Careful preprocessing can be crucial for optimizing the true binding motif.
Following pipeline has so far yielded good results for us.
- Use your favorite peak caller to obtain peaks from non-redunant bound regions
- Rank the sequences by the score obtained for each peak (e.g. q-Value)
- Extract fasta sequences centered on the peak regions of fixed length (e.g 201)
- Submit a fasta file with sequences from the highest ranked peaks (e.g. 5000)
Can I use the server with long sequences (>250bp)?¶
BaMMserver uses a ZOOPS model for learning and evaluating its higher-order models. That means all motifs are trained independently from each other and every sequence in the input file is considered to have either exactly one or no occurrence of the motif.
This setting is optimized for short sequences that are strongly enriched for the motif of interest, e.g. generated by CLIP-,ChIP- or SELEX-based methods. For longer sequences (e.g. scanning full promoter sequences), our ZOOPS model has limitations:
- Low complexity repeat sequences (e.g. ATATATATAT repeats) are abundant in genomes. Repeats will appear as strong motifs, despite having little biological significance for most questions.
- Our evaluation metric AvRec is based on how well the motif can distinguish input sequences from scrambled sequences. In our ZOOPS model only the best motif occurrence per sequence is used for classification. The longer the input sequences the higher the chance of finding a motif by chance.
- This has several implications:
- Despite of their strong enrichment, low complexity motifs are often biologically irrelevant.
- all but very long and informative motifs (often low complexity repeats!) score poorly in the AvRec benchmark.
- Following options are possible:
- Use only the seeding stage (
Manual seed selection
, see De-novo motif discovery) which learns PWMs using a MOPS model. Skip the refinement of the seeds to BaMMs. - Chop long sequences up into multiple smaller sequences (e.g. 100bp) to get a more robust performance estimation, especially when only few sequences (<1000) are used (see also Can I use the server with very few sequences?).
- Use only the seeding stage (
Can I use the server with very few sequences?¶
You need to have at least 10 sequences - robust performance evaluation requires 100 or more sequences.
The higher-order motif refinement and the motif quality evaluation relies on the ZOOPS (zero or one occurrence per sequence) model. Very long sequences (>500) with more than one motif are best chopped up into smaller sequences before uploading to the higher-order refinement.
The seeding stage itself uses a MOOPS model. It is therefore possible to scan a handful of very long sequences for enriched PWMs. You can circumvent the minimum requirement of sequences for the seeding stage by adding extra sequences with the sequence ‘NNNNNNNNNNNN’. Please note that these seeds cannot be optimized to higher-order models due to the ZOOPS assumption.
How do I figure out whether my motif is biologically relevant?¶
Motif learners find enriched sequence motifs from the input data. However statistical significant motifs do not have to have to play a role in regulatory processes. De-novo motifs should be analyzed carefully - regulatory function should not be ascribed without further validation. We offer several ways to help validating the motifs:
Infer relevance from P-value distribution¶
By performing quality control and only selecting the 1000-5000 most strongly bound sequences, true motifs should be present in a significant amount of sequences. Have a look at the p-value statistic for calculating the motif AvRec (upper right plot in evaluation panel).
- There are mainly two things to ensure (see also Motif p-value distribution indicates the relevance of the motif):
- The p-value distribution should be skewed towards low p-values (The more uniform the less prevalance/information is in the motif)
- There should be a significant portion of area above the orange line (meaning that a significant portion of input sequences carry the motif).
If your motif does not pass the above criteria, try setting stricter cutoffs for selecting the sequences or shortening the sequences. If this does not help, the motif probably not relevant.
Warning
Using only very few input sequences will increase the noise on the p-value distribution and may make it hard to interpret the plot (see also Can I use the server with very few sequences?)

Motif p-value distribution indicates the relevance of the motif
Infer relevance from motif occurrences¶
When the sequences are generated from signal peaks (e.g. ChIP-seq), there is an additional information source available: when the sequences are extracted symmetrically around the peak, motifs should be enriched around the center.
The centered enrichment of motifs from ChIP-seq sequences can be appreciated in the figure below.

For sequences centered around peaks, motifs with uniform occurrence distributions are less likely to be of relevance.

Infer relevance from motif complexity¶
Low complexity repeat regions are abundant in genomes. Always be careful when interpreting repeat motifs like ACACACAC or TATATATAT. The high repeat abundance and the high information content makes them easily reach statistical significance. They are especially prominent when the input sequences are long. The webserver will show a warning if the best scoring motif is a repeat motif.
Infer relevance from MMcompare annotation¶
Motif comparisons generated by our MMcompare tool can be used as a strong indicator that the motif is relevant and a good starting point for deeper investigation of the underlying biology.
Warning
Please do not forget to always be sceptical when assigning proteins and function to your discovered motifs. The motifs can originate from cofactors with strong binding motifs, or repetitive regions.

Where is the button to visualize in the genome browser?¶
The genome browser button

is only available if all fasta headers in your input file follow this format seqid:start-end
with zero-based coordinates.
For hg38, an example fasta file would look like this:
>chr2:88600218-88600418
TGAAAGCAGATGGAGCTTTTCCTTGAGAGCCACAGAAGCAATATATGCATGCAGTTCAGGTACAGAGATGACATCACCCTTCACAATAGCATTACCTCACCCCCTAAGCATAGGAATGAGTCACCCGATAGTCAGCTGCAAATCTCTTGGTAGAAAAAAATGTAGGTTACGGTGATGCATTTTCACATCCCACTGATTTG
>chr7:37032027-37032227
TTTAAAAATATACTTGTTTGGCTTGATTCAGGCTGCTCCTCATTCCAGGCCTGCGTGAGTCATTGGAGAAACATCCTATTAGAGTGCACCCCTACTGATTGGCTTCCTTTGTATGTTCACGGTGACTCAGAAGAGATGACTCACAGTTCACGCTTATGACAAAAAGAACTTGCTCTCCCTTCCTTTTCATTACCCATGTT
[...]
You can generate an input file from genomic annotation files with the getfasta
module of bedtools [QH10][Qui14].
Please refer to the documentation of bedtools for more details.
Miscellaneous¶
Glossary¶
- ZOOPS
- Zero or One Occurrence Per Sequence, describes the modeling assumption that input sequences can contain either no motif occurrence or at most one.
- MOPS
- Multiple Occurrences Per Sequences, describes the modeling assumption that input sequences can contain zero or multiple occurrences of a motif.
- AvRec
- Average Recall, evaluation metric used by the BaMMserver, for details see AvRec evaluation.
- PWM
- Position Weight Matrix, zeroth order motif model with independent contributions of each motif positions. See also PWMs on Wikipedia.
Using the commandline tools¶
The software for both the seeding stage (PEnG-motif) and the refinement stage (BaMMmotif) are available as standalone software packages under the GPL license. Please refer to the README files in the github repositories for more details how to use them.
Setting up the server locally¶
The source code of the server is open source and freely available under the AGPL license. If you intend setting up the server on your own computer, you can find a detailed description in the webserver’s README.
Citing BaMM webserver¶
If you are using BaMM webserver in your research, please cite our webserver [KRG+18] and BaMMmotif [SSoding16] papers, if applicable.
References¶
[KFS+17] | Aziz Khan, Oriol Fornes, Arnaud Stigliani, Marius Gheorghe, Jaime A Castro-Mondragon, Robin van der Lee, Adrien Bessy, Jeanne Chèneby, Shubhada R Kulkarni, Ge Tan, and others. Jaspar 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res., 46(D1):D260–D266, 2017. doi:10.1093/nar/gkx1126. |
[KRG+18] | Anja Kiesel, Christian Roth, Wanwan Ge, Maximilian Wess, Markus Meier, and Johannes Söding. The bamm web server for de-novo motif discovery and regulatory sequence analysis. Nucleic Acids Research, 46(W1):W215–W220, 2018. doi:10.1093/nar/gky431. |
[KVG+17] | Michelle M Kudron, Alec Victorsen, Louis Gevirtzman, LaDeana W Hillier, William W Fisher, Dionne Vafeados, Matt Kirkey, Ann S Hammonds, Jeffery Gersch, Haneen Ammouri, and others. The modern resource: genome-wide binding profiles for hundreds of drosophila and caenorhabditis elegans transcription factors. Genetics, pages genetics–300657, 2017. doi:10.1534/genetics.117.300657. |
[KVY+18] | Ivan V Kulakovskiy, Ilya E Vorontsov, Ivan S Yevshin, Ruslan N Sharipov, Alla D Fedorova, Eugene I Rumynskiy, Yulia A Medvedeva, Arturo Magana-Mora, Vladimir B Bajic, Dmitry A Papatsenko, and et al. Hocomoco: towards a complete collection of transcription factor binding models for human and mouse via large-scale chip-seq analysis. Nucleic Acids Res., 46(D1):D252–D259, 2018. doi:10.1093/nar/gkx1106. |
[Qui14] | Aaron R Quinlan. Bedtools: the swiss-army tool for genome feature analysis. Current protocols in bioinformatics, pages 11–12, 2014. doi:10.1002/0471250953.bi1112s47. |
[QH10] | Aaron R Quinlan and Ira M Hall. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, 2010. doi:10.1093/bioinformatics/btq033. |
[SSoding16] | Matthias Siebert and Johannes Söding. Bayesian markov models consistently outperform pwms at predicting motifs in nucleotide sequenc es. Nucleic Acids Res., 44(13):6055–6069, Jul 2016. doi:10.1093/nar/gkw521. |
[Str08] | Korbinian Strimmer. fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics, 24(12):1461–1462, 2008. doi:10.1093/bioinformatics/btn209. |
[YSV+16] | Ivan Yevshin, Ruslan Sharipov, Tagir Valeev, Alexander Kel, and Fedor Kolpakov. Gtrd: a database of transcription factor binding sites identified by chip-seq experiments. Nucleic Acids Res., pages gkw951, 2016. doi:10.1093/nar/gkw951. |