JASPAR is a collection of transcription factor DNA-binding
preferences, modeled as matrices. These can be converted into Position Weight
Matrices (PWMs or PSSMs),
used for scanning genomic sequences.
JASPAR is the only database with this scope where the data can be
used with no restrictions (open-source). For a comprehensive review of models
and how they can be used, please see the following reviews
DNA binding sites: representation and
discovery Bioinformatics. 2000 Jan;16(1):16-23
JASPAR is a collection of smaller databases,
which have different goals. Most scientists use JASPAR CORE.
Since JASPAR 4, all matrix models have
versions. This is primarily to keep track of improvements – which can be
anything from correcting typos to actually making a new model based on new
data. Version control works as follows:
IDs are based on a stable ID, and a version number, so that the whole ID
is [stable ID].[version].
The stable ID follows a certain transcription factor, or other logic unit such
as a dimmer pair. For instance, the stable ID for the factor GATA1 is MA0035.
However, the GATA1 matrix has been updated with new data, so there are two
versions: GATA1.1 and GATA1.2. Per default, only the latest version is shown,
but it is possible to list all versions of a matrix with the same stable ID.
The
JASPAR CORE database contains a curated,
non-redundant set of
profiles from published articles. All profiles are derived from
published collections of experimentally defined transcription factor binding
sites for multi-cellular eukaryotes. The database represents a curated collection of target sequences. The binding sites
were historically determined either in SELEX experiments, or by the collection
of data from the experimentally determined binding regions of actual regulatory
regions; this distinction is clearly marked in the profiles' annotation. A
number of new high-throughput techniques, like ChIP-seq,
can also be used.
One of the central goals with JASPAR_CORE is to give the single, “best” model for each transcription factor. This
means that the database is non-redundant in the sense that there are not many
models for the same factor (with some few exceptions motivated by biological
complexity, such as dimers and/or splice forms)
The prime difference to similar resources
(TRANSFAC, etc) consist of the open data access, non-redundancy and
quality: JASPAR CORE is a smaller set that is non-redundant and curated.
JASPAR_CORE is what most scientists mean when referring to JASPAR
in manuscripts.
For convenience, JASPAR_CORE is divided by larger groups of
species. This distinction is only used in the web interface and, optionally, in
the download section.
Dealing with new data in JASPAR:
PLACEHOLDER
What annotation data does each entry hold?
|
ID |
a unique identifier for each model. CORE matrices always have a MAnnnn IDs. Version |
|
Name |
The name of the transcription factor. As
far as possible, the name is based on the standardized Entrez
gene symbols. In the case the model describes a transcription factor hetero-dimer, two names are concatenated, such as RXR-VDR. In a
few cases, different splice forms of the same gene have different binding
specificity: in this case the splice form information is added to the name,
based on the relevant literature. |
|
Class |
Structural class of the transcription
factor, based on the TFCaT system |
|
Family |
Structural sub-class of the transcription
factor, based on the TFCaT system |
|
Species |
The species source for the sequences, in Latin.
Linked to the NCBI Taxonomic browser. The actual database entries are the
NCBI tax IDs – the latin conversion is only in the
web interface. |
|
Tax_group |
Group of species, currently consisting of 4
larger groups: vertebrate, insect, plant, chordate |
|
Acc |
A representative protein accession number
in Genbank for the transcription factor. Human
takes precedence if several exists. |
|
Type |
Methodology used for matrix
construction (see below) |
|
Medline |
a link to the relevant publication reporting the sites used in
the mode building |
|
Pazar_tf_id |
A link to the PAZAR database |
|
Comment |
For some matrices, a curator comment
is added |
When should it be used?
When seeking models for specific factors or structural classes, or
if experimental evidence is paramount
JASPAR collections are collections of matrices that are useful but
do not fit under the JASPAR CORE scope. Examples include splice forms,
computationally derived patterns with no liked transcription factors, meta-models
etc.
The JASPAR FAM database consists of 11 models describing shared
binding properties of structural classes of transcription factors. These types
of models can be called "familial profiles", "consensus
matrices" or metamodels. The models have two
prime benefits: 1)Since many factors have similar
target sequences, we often experience multiple predictions at the same
locations that correspond to the same site. This type of
models reduce the complexity of the results. 2)The
models can be used to classify newly derived profiles (or project what type of
structural class its cognate transcription factor belongs to). The construction
of the models is based on the JASPAR CORE database and described in detail in
Sandelin A, Wasserman WW. Constrained
binding site diversity within families of transcription factors enhances
pattern discovery bioinformatics J Mol Biol. 2004 Apr 23;338(2):207-15.
A recent, comprehensive study of familial binding profiles and associated
methods is available in (that plos paper by maohney et al)
When data does each entry hold?
|
ID |
A unique identifier for each model.
FAM matrices always have MFnnnn
IDs |
|
Name |
The name of model. In this database, models
were built by first partitioning JASPAR CORE matrices into
structural classes – therefore, the names are essentially structure
class names |
|
Medline |
The source article (always J Mol Biol. 2004 Apr 23;338(2):207-15) |
|
Included models |
The JASPAR CORE matrices used to construct
the model |
|
Type |
Always “Metamodel” |
When should it be used?
When searching large genomic sequences with no prior knowledge. For classification of new user-supplied profiles.
The JASPAR PHYLOFACTS database consists of 174 profiles that were
extracted from phylogenetically conserved gene
upstream elements.
For a detailed description, see Xie et
al., Systematic discovery of regulatory
motifs in human promoters and 3' UTRs by comparison
of several mammals.,
Nature 434, 338-345 (2005) and supplementary material
In short, the authors used the following strategy. Promoters
(defined as the 4-kb region around the TSS) of human genes from the RefSeq database were aligned against the genomes of mouse,
rat and dog. Every consensus sequence of length between 6 and 26, defined over
an alphabet of 4 unique (A,C,G,T) and 7 degenerate (R,
Y, K, M, S, W, N) nucleotides, was scanned over the alignments. A motif is
regarded as conserved when it appears in the alignment both for the human and
for the other three mammalian species. The conservation rate p is defined as
the number of times a motif is conserved divided by the number of times it
occurs in man only. This conservation rate is compared to the expected
conservation rate p0, estimated from random motifs, which gives the motif conservation
score MCS. Only motifs with an MCS>6 were retained,
resulting in a list of 174 highly conserved motifs (see supplementary Table S2
of Xie et al.). The count matrices for these 174
motifs were extracted from the downloaded alignments. They were further
annotated according to their resemblance with TRANSFAC and JASPAR CORE motifs.
For TRANSFAC, the annotation of Xie et al. was used.
For comparing to the JASPAR CORE matrices, the Pearson Correlation Coefficient
(PCC) was used to define matrix similarity. All PHYLOFACTS matrices were
scanned against the JASPAR CORE matrices, and matrices were regarded as being
similar when PCC>0.8. When multiple hits were found, only the one with the
highest PCC was retained. .
What data does each
entry hold?
|
ID |
a unique identifier for each model. PHYLOFACTS matrices always have MFnnnn IDs |
|
Name |
The name of model. In this database, models
are based on over-represented words which are
unique. The name is simply the consensus sequence. |
|
Jaspar |
The JASPAR_CORE motif that has the best
similarity score when compared to this model. Only hits with a similarity
score over 0.8 are considered. |
|
Transfac |
The transfac
(public version) motif that has the best similarity score when compared to
this model. Only hits with a similarity score over 0.8 are considered. |
|
Sysgroup |
Group of species. Always “mammals” |
|
Type |
Always “phylogenetic” |
|
Medline |
The source article (always Nature
434, 338-345 (2005)) |
When should it be used?
The JASPAR PHYLOFACTS matrices are a mix of motifs corresponding
to motifs for known and undefined transcription factors. They are useful when
one expects that other factors might determine promoter characteristics, such
as structural aspects and tissue specificity. They are highly complementary to
the JASPAR CORE matrices, so are best used in combination with this matrix set.
The deluge of novel data presented recently pertaining
transcription start sites (reviewed in (13,14)) motivates computational studies
of core promoters. The JASPAR_POLII sub-database holds known 13 DNA patterns
linked to RNA polymerase II core promoters, such as the Inr
and BRE elements, each based on experimental evidence: each model must be
constructed using 5 or more experimentally verified sites. An important
difference to the transcription factor profiles in JASPAR CORE is that patters
here do not necessarily have a specified protein interactor
(See (15) for a review on core promoter patterns). When possible,
profiles were extended by two nucleotides more than the core motif. We
consistently report positions relative to the TSS as the position of 5’ and 3’
edge of the matrix.
When data does each entry hold?
|
ID |
a unique identifier for each model. POLII matrices always have POLnnn IDs |
|
Name |
The reported name of the pattern (not
necessarily the binding protein, if this is known) |
|
Species |
The species source for the sequences, in
Latin. “-“ generally signifies that several species
were used in the model construction |
|
Medline |
A link to the relevant publication
reporting the sites used in the mode building |
|
Start relative to TSS |
Reported bias (if any) on position relative
to the dominant transcription start site in the promoter. This is counted
from the 5' end of the pattern (the left side). As we have added some
flanking nucleotides, this sometimes is not the exact numbers shown in the
source publications. |
|
End relative to TSS |
See above. Distance is counted from the 3'
end of the matrix (the right side). |
When should it be used?
When analyzing properties of core promoters.
Highly conserved non-coding elements are a distinctive feature of
metazoan genomes. Many of them can be shown to act as long-range enhancers that
drive expression of genes that are themselves regulators of core aspects of
metazoan development and differentiation. Since they act as regulatory inputs,
attempts at deciphering the regulatory content of these elements have started.
JASPAR CNE is a collection of 233 matrix profiles
derived by Xie et al based on clustering of overrepresented motifs
from human conserved non-coding elements. While the biochemical and biological
role of most of these patterns is still unknown, Xie
et al. have shown that the most abundant ones correspond to known DNA-binding
proteins, most notably insulator-binding protein CTCF. These matrix profiles
will be useful for further characterization of regulatory inputs in long-range
developmental gene regulation in vertebrates.
What data does each entry hold?
|
ID |
a unique identifier for each model. NCRNA matrices always have CNnnnn IDs |
|
Name |
The name of model. |
|
Consensus sequence |
the consensus sequence of the motif - important as it is the basis
for clustering over-represented sites in this study |
|
Medline |
The source article (always Xie et al) |
When should it be used?
When analyzing properties of potential enhancers.
This small collection contains matrix profiles of human canonical
and non-canonical splice sites, as matching donor:acceptor pairs. It currently contains only 6 highly
reliable profiles obtained from human genome made by Chong
et al . In the future, we shall include
additional eukaryotic species, as well as new models for exonic
splicing enhancers (ESE) and inhibitors (ESI).
What data does each entry hold?
|
ID |
a unique identifier for each model. SPLICE matrices always have SPnnnn IDs |
|
Name |
The name of model. |
|
|
|
|
Medline |
The source article (always
Chong
et al ) |
When should it be used?
When analyzing splice sites and alternative
splicing
All the PBM collections are built by using new in-vitro techniques, based
on k-mer microarrays. PBM matrix models have their
own database which is specialized for the data: UniPROBE.
The PBM, collection is the set derived by Badis et al from binding preferences of 104 mouse
transcription factors. One profile (IRC900814) was excluded because the
transcription factor could not be identified.
What data does each entry hold?
|
ID |
a unique identifier for each model. SPLICE matrices always have PHnnnn IDs |
|
Name |
The name of model. |
|
Class |
Structural class of the transcription
factor, based on the TFCaT system |
|
Family |
Structural sub-class of the transcription
factor, based on the TFCaT system |
|
Species |
The species source for the sequences, in
Latin. Linked to the NCBI Taxonomic browser. The actual database entries are
the NCBI tax IDs – the latin conversion is only in
the web interface. |
|
Tax_group |
Group of species, currently consisting of 4
larger groups: vertebrate, insect, plant, chordate |
|
Medline |
A link to the relevant publication
reporting the sites used in the mode building |
|
Type |
Methodology used for matrix construction |
|
Comment |
For some matrices, a curator comment
is added |
When should it be used?
Where it is important that each matrix was
derived using the same protocol
All the PBM collections are built by using new in-vitro techniques, based
on k-mer microarrays. PBM matrix models have their
own database which is specialized for the data: UniPROBE.
The PBM, collection is the set derived by Berger
et al including 176 profiles from mouse homeodomains
What data does each entry hold?
|
ID |
a unique identifier for each model. SPLICE matrices always have PHnnnn IDs |
|
Name |
The name of model. |
|
Class |
Structural class of the transcription
factor, based on the TFCaT system |
|
Family |
Structural sub-class of the transcription
factor, based on the TFCaT system |
|
Species |
The species source for the sequences, in
Latin. Linked to the NCBI Taxonomic browser. The actual database entries are
the NCBI tax IDs – the latin conversion is only in
the web interface. |
|
Tax_group |
Group of species, currently consisting of 4
larger groups: vertebrate, insect, plant, chordate |
|
Medline |
A link to the relevant publication
reporting the sites used in the mode building |
|
Type |
Methodology used for matrix construction |
|
Comment |
For some matrices, a curator comment
is added |
When should it be used?
Where it is important that each matrix was derived
using the same protocol, focused on homeobox factors
All the PBM collections are built by using new in-vitro techniques, based
on k-mer microarrays. PBM matrix models have their
own database which is specialized for the data: UniPROBE.
The PBM HLH, collection is the set derived by
Grove
et al. It holds 19 C. elegans bHLH transcription factor models
What data does each entry hold?
|
ID |
a unique identifier for each model. SPLICE matrices always have PHnnnn IDs |
|
Name |
The name of model. |
|
Class |
Structural class of the transcription
factor, based on the TFCaT system |
|
Family |
Structural sub-class of the transcription
factor, based on the TFCaT system |
|
Species |
The species source for the sequences, in
Latin. Linked to the NCBI Taxonomic browser. The actual database entries are
the NCBI tax IDs – the latin conversion is only in
the web interface. |
|
Tax_group |
Group of species, currently consisting of 4
larger groups: vertebrate, insect, plant, chordate |
|
Medline |
A link to the relevant publication
reporting the sites used in the mode building |
|
Type |
Methodology used for matrix construction |
|
Comment |
For some matrices, a curator comment
is added |
When
should it be used?
Where it is important that each matrix was
derived using the same protocol, focused on bHLH
factors
The JASPAR database can now be reached remotely through a new Web
Service interface. Current functionality includes retrieval of profiles by
name, by identifier and by searching profile annotations. Profiles can be
retrieved as position frequency matrices, position weight matrices or
information content matrices. The purpose of providing an external application
programming interface (API) is to simplify the utilization of JASPAR in
distributed applications and in scientific workflows created in workflow
editors like Triana , BPEL ,
or Taverna . Other benefits include platform-
and language independent access, as well as constant up-to-date access to the
database over time. The API is implemented as a WS-I compliant Web service,
identical to the technology used for the services made available through the EMBRACE Network of Excellence, and the Web
service technology chosen by the European Bioinformatics Institute (EBI) . The WSDL describing this service can be found
here. Further
information about the Web service is available in the WSDL file, including
example clients in Java and Python.
JASPAR has a fully fledged API in the Perl programming language: TFBS.pm ,
which interacts with the mysql version of JASPAR. The
API can be used for a large number of tasks - including searching
sequences and alignments. See the
TFBS page and the TFBS::DB::JASPAR5 section in the
manual. JASPAR5.pm is very similar
to JASPAR4 in terms of methods, but is built on another data model ‘under the
hood’
JASPAR is downloadable in two different formats,
from the DOWNLOAD link in the start page:
i) flat files resulting from the TFBS::DB::MatrixDir function in the perl
API, which are easily parsable
ii) flat
files corresponding top the mysql tables used
internally in the database. The create table statements are here
Downloading sites and alignments
In the DOWNLOAD
directory, most matrix collections have a SITE subdirectory, which for each
model lists all sites used for the model construction as a fasta file. The alignments are implicit – the used
sub-parts of sequences are in capitals. Note that in the majority of cases,
this is an interpretation – we use pattern finders to find the most likely
alignment, but this might not always be the most correct. This is the principal
reason we make these collections available – users can make their own models
based on the raw files.
The start page has four major “tabs” that
determines the way in which you will interact
with respective JASPAR database. First, use the “ SELECT A JASPAR
SUB-DATABASE” (this will also give a brief
summary of respective database. After this, you can either use the
BROWSE tab: The whole selected
database will be shown (see the below for information about the browse
page), sorted by the selected attribute (default attribute is ID). As many
users are interested in the JASPAR_CORE database, you can click the “Browse the
JASPAR_CORE database right away” just under the JASPAR image.
SEARCH BY tab:
selection of subsets of profiles using user set criteria, use the search
by fields.
Multiple inputs are acceptable, if submitted
with commas in between ',' (this will in effect be interpreted as an OR
statement)
The criteria will be interpreted from top to
bottom by using the boolean
statement at each row:
an AND statement will perform an intersect between two query results
an OR statement will perform a union of two query results
an NOT statement will filter out results from the first query
ALIGN to custom matrix tab: In some cases, it
is beneficial to assess similarity to input data (as with using BLAST for
sequence data comparison when using Genbank). The
input profile can consist of actual counts or be normalized (each column sum
=1). Log-odds matrices should be avoided. For an example of the input format,
press the “fill in an example matrix”. The A[ ] etc
characters are optional, so:
A [13 13 3 1 54 1 1 1
0 3 2 5 ]
C [13 39 5 53 0 1 50 1 0 37 0 17 ]
G [17 2 37 0 0 52 3 0 53 8 37 12 ]
T [11 0 9 0 0 0 0 52 1 6 15 20 ]
is equivalent to
13 13 3 1 54 1 1 1
0 3 2 5
13 39 5 53 0 1 50 1 0 37 0 17
17 2 37 0 0 52 3 0 53 8 37 12
11 0 9 0 0 0 0 52 1 6 15 20
All profiles in the selected database will be
compared to the input profile, using a modified Needleman-Wunsch
algorithm described in
Sandelin A, Hoglund
A, Lenhard B, Wasserman WW. Integrated analysis of
yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics. 2003 Jul;3(3):125-34 and sorted by raw comparison score
(for reference, the maximum score is 2*the width of the smallest matrix in the
compared pair). Both the score and fraction of potential maximal score is
reported.
This page presents a list of selected
profiles. At the top, it is posisble to select a
subset of the matrices, much like in the start page described above.
Note that the columns of this page will
differ between databases, but ID and Name attributes will always be shown, as
well as a sequence logo for each pattern. For detailed information regarding
any profile model, press the view link – this will give a pop-up window with
detailed matrix information (see below).
At the right-hand side, a number of
functional analyses can be made, with selected profiles (see the selection
field in the left-most column). Currently, its is possible to i) Cluster selected matrices into familial binding profiles
using the STAMP tool, ii) create permuted matrices using column shuffling o the
selected matrices iii) create random matrices
using a Bayesian sampling procedure, iv) perform basic sequence analysis
(scanning an input sequence with matrices).
These features are described in detail below,
in section d) : extended functionality.
These pop-up pages are results of clicking on
the logos of chosen models in the browse page, and show detailed information
about the model: both annotation data (which is different in different
databases – see respective database entry above), and a sequence logo, a count
matrix and hits/bp statistics:
Sequence logos are graphical
representations of the matrix model, based on information content. The
information content of a matrix column ranges from 0 (no base preference) and 2
(only 1 base used). A sequence logo is basically a barplot
showing the total information content in each position, where the bar is
replaced by stacked letters (A,C,G,T), which are sized
and sorted relative to their occurrence. See Schneider
et al for
a more comprehensive description.
The “Make SVG button” gives an
SVG version logo suitable for publication images (SVG is a vector format
that does not have the pixel edges of the .png’s used in browsers – it can be read by many
drawing programs and most web-browsers with proper plugins)
The
underlying model showing the DNA pattern. In
most databases, the cell numbers indicate the number of sequences having base x
in column y. These matrices can be used for a number of different analyses,
including site searching, if suitably converted, See Wasserman and
Sandelin for a review.
The reverse complement button
make a reverse complement version of the matrix (as DNA is two-stranded,
the two models are functionally equivalent). If the amtrix
if reverse-complemented, the logo will change accordingly.
For some transcription factors, there are
multiple models – usually this is due to new data becoming available. Clicking
on this link gives a liting of all the models for the
factor in question.
In order to visualize
the binding properties of each JASPAR matrix we calculate the average number of
hits per 1000 base pairs on three distinctly different sequence sets. We do
this by converting the count matrix to a log-odds matrix using a uniform
background model over the four bases. For a series of threshold values ([1,
0.95, ... , 0.65, 0.60]) of the scoring range of the
log-odds matrix we count the number of hits equal to or greater than the current
threshold. We count the number of hits treating each sequence set as one string
and then convert this number to a mean value per 1000 base pairs on both
strands, that is, we search both the leading strand and the reverse complement.
All means are for practical purposes rounded to one decimal.
We use three distinct sequence sets, known promoters, CpG islands and random DNA respectively. The known
promoters consist of all plant, arthropod and vertebrate promoters in the -1000
to +100 region from the EPD database
[ref 1]. This sequence set totals 4735 promoters concatenated into one string.
The CpG sequence set consists of all regions from the
UCSC genome browser
(hg18) with an epigenetic score above 0.5 (See Bock
et al).. This totals 8,559,418 nucleotides.
Finally the random DNA sequences are randomly picked 1000 base pair windows
from hg18 across all chromosomes and totals 8,000,000 nucleotides. The randomly
picked DNA is not repeat-masked or in any way filtered.
Using a subset of profiles, a
submitted sequence can be analyzed. Sensitivity and specificity will be
affected by the relative score threshold, by default 80% (See Wasserman
and Sandelin for a review on scoring of matrices to sequences)
. This is the most basic form of sequence analysis: dedicated systems
such as ConSite are preferable for anything more than a casual
analysis.
The CLUSTER button provides the user with a means of investigating the relationship between the various matrices. This functionality is provided by the STAMP tool available as a webservice at http://www.benoslab.pitt.edu/stamp/
.1. PERMUTATION
This option simply shuffles the columns in matrices. This can either be done by just shuffling columns within each
selected matrix, or by shuffling columns almong
all selected matrices.
2. SAMPLING
This feature of the database enables the users to generate random
Position Frequency Matrices (PFMs) from selected
profiles.
We assume that each column in the profile is independent and described by a
mixture of Dirichlet multinomials
in which the letters are drawn from a multinomial and the multinomial
parameters are drawn from a mixture of Dirichlets.
Within this model each column has its own set of multinomial parameters but the
higher level parameters -- those of the mixture prior
is assumed to be common to all Jaspar matrices. We
can therefore use a maximum likelihood approach to learn these from the
observed column counts of all Jaspar matrices. The
maximum likelihood approach automatically ensures that matrices receive a
weight relative to the number of counts it contains.
Drawing samples from the prior distribution will generate PWMs
with the same statistical properties as the Jaspar
matrices as a whole. PWMs with statistical properties
like those of the selected profiles can be obtained by drawing from a posterior
distribution which is proportional to the prior times a multinomial likelihood
term with counts taken from one of the columns of the selected profiles.
Each 4-dimensional column is sampled by the following three-step procedure: 1. draw the mixture component according to the distribution of
mixing proportions, 2. draw an input column randomly
from the concatenated selected profiles and 3. draw
the probability vector over nucleotides from a 4-dimensional Dirichlet distribution. The parameter vector alpha of the Dirichlet is equal to the sum of the count (of the drawn
input) and the parameters of the Dirichlet prior (of the drawn component). .
Draws from a Dirichlet can be obtained in the
following way from Gamma distributed samples:
(X1,X2,X3,X4) = (Y1/V,Y2/V,Y3/V,Y4/V) ~
Dir(α1,α2,α3,α4)
where V = sum(Yi) ~ Gamma(shape = sum(αi), scale = 1).
3. OUTPUT
FORMATS
For both and random generating of matrices you have the choice between three
different output formats:
Raw - Each matrix is separated by a fasta
like header starting with the > symbol and then a matrix ID. The count for
each base (ACGT) is specified on its own space separated line where each
element corresponds to one column. The order of the lines for the bases is A,C,G and finally T.
13 13 3 1 54 1 1
1 0 3 2 5
13 39 5 53 0 1 50 1 0 37 0 17
17 2 37 0 0 52 3 0 53 8 37 12
11 0 9 0 0 0 0 52 1 6 15 20
JASPAR - This is similar to the raw format, having an identical
header. The lines for each base however starts with a label for the nucleotide
(A,C,G or T) and then the columns follow enclosed in
brackets: [].
A [13 13 3 1 54 1 1
1 0 3 2 5 ]
C [13 39 5 53 0 1 50 1 0 37 0 17 ]
G [17 2 37 0 0 52 3 0 53 8 37 12 ]
T [11 0 9 0 0 0
0 52 1 6 15 20 ]
TRANSFAC - This is a TRANSFAC-like format having a header starting
with "DE" then the matrix ID, the matrix name and the matrix class.
The data itself is transposed as compared to the other formats, meaning that
each line correspond to a column in the matrix. The column lines start with a
number denoting the column index (counting
from 0).
After that follows tab separated counts for each base in that column in the
order: A,C,G and T. After the lines
with the counts follows a final line containing the string:
"XX".
DE MA0048 NHLH1 bHLH
00 13 13 17
11
01 13 39
2 0
02 3 5
37 9
03 1 53
0 0
04 54 0
0 0
05 1 1
52 0
06 1 50
3 0
07 1 1
0 52
08 0 0
53 1
09 3 37 8
6
10 2 0
37 15
11 5 17
12 20
XX
a. How do I
cite JASPAR?
It depends
on what you have used it for. If you simply want to acknowledge you used the
last version, use
Bryne JC, Valen
E, Tang MH, Marstrand T, Winther
O, da Piedade I, Krogh A, Lenhard B, Sandelin A.
Nucleic
Acids Res. 2008 Jan;36(Database issue):D102-6.
Otherwise:
The original JASPAR paper:
Sandelin A, Alkema W, Engstrom
P, Wasserman WW, Lenhard B.
JASPAR: an open-access database for eukaryotic transcription
factor binding profiles.
Nucleic
Acids Res. 2004 Jan 1;32(Database issue):D91-4.
The first extension (JASPAR FAM and PHYLOFACTS collections):
Vlieghe D,
Sandelin A, De Bleser PJ, Vleminckx
K, Wasserman WW, van Roy F, Lenhard B.
A new generation of JASPAR, the open-access repository for
transcription factor binding site profiles.
Nucleic
Acids Res. 2006 Jan 1;34(Database issue):D95-7.
Second expansion (POLII, SPLICE, CNE, many changes in the web
service including matrix permutations)
Bryne JC, Valen
E, Tang MH, Marstrand T, Winther
O, da Piedade I, Krogh A, Lenhard B, Sandelin A.
Nucleic
Acids Res. 2008 Jan;36(Database issue):D102-6.
Third expansion ( Large expansion of the
CORE database, including yeast and worm matrices. Also includes new PBM
collections)
JASPAR, the open access database of
transcription factor binding profiles: new and improved content in the 2010
update
Nucleic
Acids Res. 2010 Jan; under review
b. Why are
certain sequences not downloadable from
JASPAR_CORE?
This is due
to historical reasons. JASPAR_CORE was originally built in order to create
familial binding profiles for as many structural classes of transcription
factor classes as possible. In some experimental literature, only matrices and
not sequences are available. For this project, we were forced to include some
matrices to gain coverage of certain binding site classes. For recent
additions, it is a requirement to have the sequences available.
d. Why is not my matrix study included in JASPAR_CORE?
There are two principal explanations.
The most likely is that we were not aware of your work: please let us
know!
The other possible reason is that the publication did not live up
to the demands of the curators. As we have human curation
of all JASPAR CORE matrices, this is to some degree an arbitrary call – we are
happy to discuss it with you.
e. Linking web
services to CPU-intensive services within JASPAR
We appreciate that other services wants to link to JASPAR.
However, if your are using the CPU-intensive services
(matrix comparison, randomization or clustering), please ask the maintainers
(see contact information below) before you do this – otherwise your server
might be rejected without warning. In that case, we strongly suggest setting up a local JASPAR database, as the
database and resources are freely available.
f. Who is
JASPAR anyhow?
JASPAR was originally the name of a master student project
algorithm for comparing matrix profiles, an obscure tribute to an even more
obscure dialog from the Black Adder episode “The Black Seal” between the Seven
Most Evil Men in the Kingdom:
- …and with all haste, we will meet at Old Jaspar’s tavern
- How is old Jaspar these days?
- Dead.
- How?
- I killed him.
[Loud cheer].
We appreciate feedback – criticism as well as
suggestions for new content. Development and supervision of
the JASPAr project is coordinated by Albin Sandelin
and Boris Lenhard.
Albin Sandelin albin
AT binf.ku.dk
Boris Lenhard Boris.Lenhard AT bccs.uib.no