We had originally generated a gene list for the project based on a selection of genes using GO terms from TAIR6. This list contained a total of 6400 candidate genes. Since we require detailed knowledge whether a protein is a membrane protein and especially of the topology of the proteins, we initiated a careful prediction of topology for the Arabidopsis membrane proteins using categorization into protein families and subsequent evaluation of the topology predictions on the basis of known structures of homologs of these membrane proteins. We also unified predictions for topology for complete families. During the course of this we noticed errors in annotation. Thus an extensive re-annotation had to be carried out, which will be implemented in both Aramemnon and TAIR.
26922 protein sequences from Arabidopsis were clustered using the Markov Cluster algorithm (MCL, http://micans.org/mcl/). Only protein sequences corresponding to gene models .1 from TAIR7 were used. The clustering procedure resulted in 3360 clusters containing 21807 proteins, and 5115 singletons.
The composition of each cluster was analyzed using multiple sequence alignments. Any protein that obviously did not share overall sequence similarities with the other proteins of the cluster was removed: this ensured that each cluster contained proteins of the same family, and no protein containing only a local similar domain.
Clusters of interest were selected based on the cluster member protein descriptions by TAIR, on published data, and on topology data predicted by the Aramemnon database. The aim was to select: (1) membrane proteins that are not targeted to the chloroplast or the mitochondrion, and (2) proteins involved in signaling and protein turnover, likely to interact with membrane proteins. Transcription factors (families retrieved at the TAIR website), pentatricopeptide domain-containing proteins, chloroplast and mitochondrial proteins, as well as soluble protein not involved in signal transduction or ubiquitinylation were excluded. 544 genes that do not respond to the above criteria but that had been obtained as donations were also accepted. 209 proteins found by proteomic or approaches to be membrane-localized were finally added. A final list of 8415 candidate genes was thus obtained. According to the Aramemnon database, 5769 proteins are predicted as membrane-localized (either integral or associated to the membrane) and 2646 are predicted to be soluble.
This list is significantly larger than the original list. Cloning of genes with low abundance transcription levels will be difficult, thus rather than attempting to obtain a complete set of these 8400 genes, we will use this list as a basis to clone as many genes as possible within the scope of the project.
AGI Accession in green: we already have the gene and many are already available from ABRC
AGI Accession in red: We have those clones. The clones were given by 3rd party and cannot be made available through.
AGI Accession in grey: Cloning of those genes is under way.
