%\VignetteIndexEntry{Using the PAnnBuilder Package}
%\VignetteKeywords{Annotation}
%\VignetteDepends{PAnnBuilder}
%\VignettePackage{PAnnBuilder}

\documentclass[12pt,fullpage]{article}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\usepackage{amsmath,pstricks,fullpage}
\usepackage{hyperref}
\usepackage{url}
\usepackage[authoryear,round]{natbib}
\usepackage{multirow}

\author{Hong Li$^\ddagger$\footnote{sysptm@gmail.com}}
\begin{document}
\title{Using the PAnnBuilder Package}
\maketitle
\begin{center}$^\ddagger$Key Lab of Systems Biology\\ 
Shanghai Institutes for Biological Sciences\\ 
Chinese Academy of Sciences, P. R. China
\end{center}

\tableofcontents

\section{Overview} 

In genomic era, genome-scale experiments and data analyses require genes to be 
annotated from different sources, an R \cite{1} package AnnBuilder was developed
for this purpose \cite{2}. In post genomic era, advances in proteomics highlight 
the urgency of understanding protein language \cite{3}. However, relative to genes, 
AnnBuilder is limited in protein annotation due to the complexity of proteins 
caused by 3-D strucutre, alternative splicing, modification, dynamic location 
and other features. The package PAnnBuilder focuses on assembling proteomic 
annotation data, which should be very useful for proteomic data interpretation. 
It not only inherits the good features of AnnBuilder such as automatically parsing 
protein annotation data and building R packages from selected sources, but also 
emphasizes specific features of proteins. PAnnBuilder currently supports 16 
databases involving diverse aspects of proteomics, such as protein primary data, 
protein domain/family, subcellular location, protein interaction, post-translational 
modifications, body fluid proteomics, homolog/ortholog groups and so on. 
Additionally, PAnnBuilder allows annotation to unknown proteins based on sequence 
similarities to other well-annotated proteins. To extend the use of PAnnBuilder, 
54 standard R annotation packages are produced from main protein databases, 
which are freely available for download via biocLite.

\section{Getting Started}

\subsection{Requirements}

PAnnBuilder requires the support from the following items:
\begin{enumerate}
\item For PAnnBuilder >= 1.5.0, R >=2.7.0 is needed for building SQLite-based 
package. Dependent R packages are needed to be installed: \Rpackage{methods}, 
\Rpackage{utils}, \Rpackage{RSQLite}, 
\Rpackage{Biobase} (>= 1.17.0), \Rpackage{AnnotationDbi (>= 1.3.12)}. 
If you use the installation script "biocLite.R" to install PAnnBuilder (>=1.5.0), 
it will automatically check these dependent packages and install missing 
packages from CRAN or Bioconductor.
\item Rtools is needed for Windows user. The matched version of Rtools with R can 
be downloaded via {\url{http://www.murdoch-sutherland.com/Rtools/}}.
\item Perl is required to parse the rather large annotation source data files. 
It can be downloaded from {\url{http://www.perl.com/download.csp}}.
\item Program formatdb and BLASTP is needed for function \Rfunction{crossBuilder}, 
\Rfunction{crossBuilder\_DB}, \Rfunction{pSeqBuilder}, \Rfunction{pSeqBuilder\_DB}.

BLAST can be downloaded from {\url{http://www.ncbi.nlm.nih.gov/BLAST/download.shtml}}.
\end{enumerate}
Note: It is better to have enough space for the temporary directory. The path of
the per-session temporary directory can be acquired by:
<<tempdir>>=
tempdir() 
@

\subsection{Installation}
The biocLite script is used to install PAnnBuilder from within R:
<<install, eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
biocLite("PAnnBuilder")
@
<<load>>=
library(PAnnBuilder)
@
Note: Web Connection is needed to install PAnnBuilder and its depended packages.

\subsection{Public Data Sources}

PAnnBuilder parses proteomics annotation data from public sources and build R 
annotation packages. It also provides convenient functions to access these sources. 
For example, you can get all supported databases for "Homo sapiens" by:
<<urls, eval=FALSE>>=
library(PAnnBuilder)         ## Load package 
getALLUrl("Homo sapiens")    ## Get urls 
getALLBuilt("Homo sapiens")  ## Get version/release
@
Detail description of all supported public data sources in PAnnBuilder are as 
follows:
\begin{description}
\item[UniProt] The data
    {\url{ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase}} will be
    used to map protein UniProt identifiers to diverse annotation available in 
    UniProt database.
\item[IPI] The data
    {\url{ftp://ftp.ebi.ac.uk/pub/databases/IPI/current}} will be
    used to map protein IPI identifiers to diverse annotation available in 
    International Protein Index database.
\item[RefSeq] The data
    {\url{ftp://ftp.ncbi.nih.gov/refseq}} will be
    used to map protein RefSeq identifiers to diverse annotation available in 
    NCBI RefSeq database.
\item[Entrez Gene] The data
    {\url{ftp://ftp.ncbi.nih.gov/gene/DATA}} will be
    used to annotate genes after the Entrez Gene identifiers have been obtained.
\item[Gene Ontology] The data
    {\url{ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes}} and 
    {\url{http://archive.geneontology.org/latest-termdb}}
    will be used to obtain gene ontology information. 
\item[KEGG] Some data at
    {\url{ftp://ftp.genome.jp/pub/kegg}} will be used to obtain pathway 
    information.
\item[HomoloGene] A data file provided by
  {\url{ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/current/}} will be used to
  extract mappings between GeneID/ProteinGI and HomoloGene ids.
\item[InParanoid] The data 
  {\url{http://inparanoid.sbc.su.se}} will be used to obtain ortholog protein 
  groups between two organisms.
\item[Gene Interaction] The data 
  {\url{ftp://ftp.ncbi.nih.gov/gene/GeneRIF}} will be used to extract 
  protein-protein interactions between Entrez GeneID or Protein RefSeq ids.
\item[IntAct] The data 
  {\url{ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psimitab}} will be used
  to extract protein-protein interactions between UniProt protein accession numbers.
\item[MPPI] The data 
  {\url{http://mips.gsf.de/proj/ppi/data/mppi.gz}} will be used to extract 
  protein-protein interactions between UniProt protein accession numbers.
\item[3DID] The data 
  {\url{http://gatealoy.pcb.ub.es/3did/download}} will be used to extract 
  domain-domain interactions between Pfam domain identifiers.
\item[DOMINE] The data 
  {\url{http://domine.utdallas.edu}} will be used to extract 
  domain-domain interactions between Pfam domain identifiers.
\item[DBSubLoc] The data 
 {\url{http://www.bioinfo.tsinghua.edu.cn/~guotao}} will be used to obtain
 subcellular localization for protein from SWISS-PROT and PIR database.
\item[BaCelLo] The data 
 {\url{http://gpcr2.biocomp.unibo.it/bacello}} will be used to map SwissProt
 eukaryotes protein identifiers to subcellular localization.
\item[SCOP] The data 
 {\url{http://scop.mrc-lmb.cam.ac.uk/scop}} will be used to map PDB structure 
 identifiers to SCOP domain identifiers. 
\item[PeptideAtlas] The data 
 {\url{http://www.peptideatlas.org/builds}} will be used to obtain experimentally 
 identified peptides and their coordinates on chromosomes.
\item[SysPTM] The data 
 {\url{http://www.biosino.org/papers/SysPTM}} will be used to obtain protein
 post-translational modifications information.
\item[Sys-BodyFluid] The data 
 {\url{http://www.biosino.org/papers/Sys-BodyFluid}} will be used to map IPI
 protein identifiers to body fluids.  
\end{description}

\subsection{Annotation Packages Produced by PAnnBuilder}

\subsubsection{Packages produced by PAnnBuilder}

PAnnBuilder has powerful ability to build R package for assembling proteome 
annotation. However, the process of building new package may be time-consuming 
because of the downloading and parsing of large data files. To make PAnnBuilder
useful for any users, we have built many frequently used annotation packages in
advance. These pre-built packages can be downloaded via biocLite.

PAnnBuilder produces SQLite-based R packages using the "*Builder\_DB" 
functions (see Table~\ref{table:packages1}). Each SQLite-based annotation
package contains a number of AnnDbBimap objects. 

The pre-built packages provide a quick start for R beginners. If one wants to
analyze  protein set in Human IPI database, the quickest way is to download and
use the SQLite-based package "org.Hs.ipi.db". However, if the package one wants
has not been built or a new-version database is  released, new package should be
built using functions in PAnnBuilder (See  Section 4 for methods of building
annotation packages).
\
\begin{table}
\footnotesize
\caption{\label{table:packages1}SQLite-based Annotation Packges Produced by PAnnBuilder.}
\begin{center}
\begin{tabular}{|p{0.23\textwidth}|p{0.18\textwidth}|p{0.12\textwidth}|p{0.17\textwidth}|p{0.15\textwidth}|p{0.15\textwidth}|}
\hline
\bf{Description} & \bf{R Function} & \bf{Source} & \bf{Organism} 
 & \bf{Package} \\ \hline
\multirow{9}{0.25\textwidth}{complete and canonical annotaion for all proteins of
 a specific organism, including protein description, Entrez gene 
 identifier, KEGG pathway, gene ontology, domain, and so on.} & 
\multirow{9}{0.2\textwidth}{pBaseBuilder\_DB} &
 \multirow{3}{0.15\textwidth}{IPI} & 
  Homo sapiens & org.Hs.ipi.db \\
  & & & Mus musculus & org.Mm.ipi.db \\ 
  & & & Rattus norvegicus & org.Rn.ipi.db \\ 
 & & \multirow{3}{0.15\textwidth}{Swiss-Prot} & 
  Homo sapiens & org.Hs.sp.db \\
  & & & Mus musculus & org.Mm.sp.db \\
  & & & Rattus norvegicus & org.Rn.sp.db \\ 
 & & \multirow{3}{0.15\textwidth}{RefSeq} & 
  Homo sapiens & org.Hs.ref.db \\
  & & & Mus musculus & org.Mm.ref.db \\
  & & & Rattus norvegicus & org.Rn.ref.db \\ \hline 
\multirow{3}{0.25\textwidth}{protein indentifier mapping} & 
\multirow{3}{0.2\textwidth}{crossBuilder\_DB} & 
\multirow{3}{0.15\textwidth}{Swiss-Prot, IPI, RefSeq} & 
  Homo sapiens & org.Hs.cross.db \\
  & & &  Mus musculus & org.Mm.cross.db \\
  & & & Rattus norvegicus & org.Rn.cross.db \\ \hline 
\multirow{5}{0.25\textwidth}{protein-protein or domain-domain interaction data} & 
\multirow{5}{0.2\textwidth}{intBuilder\_DB}  
 & IntAct & & int.intact.db \\
 & & NCBI Gene & & int.geneint.db \\
 & & MPPI & & int.mppi.db \\
 & & 3did & & int.did.db \\
 & & Domine & & int.domine.db \\ \hline
\multirow{2}{0.25\textwidth}{protein subcell location} &
\multirow{2}{0.2\textwidth}{subcellBuilder\_DB} 
 & BaCelLo & & sc.bacello.db \\
 & & DBSubLoc & & sc.dbsubloc.db \\ \hline
protein structure classification & scopBuilder\_DB & SCOP & 
 & scop.db \\ \hline
protein post-translational modification & ptmBuilder\_DB & SysPTM &
 & sysptm.db \\ \hline
body fluid protein & bfBuilder\_DB & Sys-BodyFluid & Homo sapiens
 & org.Hs.bf.db \\ \hline
homolog protein group & HomoloGeneBuilder\_DB & HomoloGene
 & & homolog.db \\ \hline
ortholog protein group & InParanoidBuilder\_DB & InParanoid
 & Homo sapiens, Mus musculus & org.HsMm.ortholog.db \\ \hline
peptides identified by mass spectrometry & PeptideAtlasBuilder\_DB
 & PeptideAtlas & Homo sapiens & org.Hs.pep.db \\ \hline
gene ontology & GOABuilder\_DB & GOA & Homo sapiens & org.Hs.goa.db \\ \hline
identifier and name & dNameBuilder\_DB & & & dName.db \\ \hline
\end{tabular}
\end{center}
\end{table}

\subsubsection{Using annotation data package}
SQLite-based *.db packages are capable of flexible data queries, reverse mapping,
and  data filtering. Vignette of "AnnotationDbi" detailedly described  how to use
SQLite based packages
({\url{http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/AnnotationDbi.pdf}}). 
Here package "org.Hs.ipi.db" produced by PAnnBuilder is illustrated as an example
in the following code chuck.
\begin{enumerate}
\item Install and load *.db annotation package.
<<exampleInstall1, eval=FALSE>>=
biocLite("org.Hs.ipi.db")
@
<<exampleLoad1>>=
library(org.Hs.ipi.db)
@
\item Browse data in the package.
<<help>>=
?org.Hs.ipiGENEID
@
\item Convert object into a "list" object, and get values by index or name.
<<data1>>=
xx <- as.list(org.Hs.ipiGENEID)
xx[!is.na(xx)][1:3]
xx[["IPI00792103.1"]]
@
\item Specific utilities for SQLite-based *.db packages.
<<data2>>=
## Access the data in table (data.frame) format via function "toTable".
toTable(org.Hs.ipiPATH[1:3])
## reverse the role of protein and pathway via 
## function "revmap" or "reverseSplit".
tmp1 <- revmap(org.Hs.ipiPATH)  ## return a AnnDbBimap Object
class(tmp1)
as.list(tmp1)[1]
tmp2 <- reverseSplit(as.list(org.Hs.ipiPATH)) ## return a list Object
class(tmp2)
tmp2[1]

## The left and right keys of the Bimap can be extracted 
## using "Lkeys" and "Rkeys".
Lkeys(org.Hs.ipiPATH)[1:3]
Rkeys(org.Hs.ipiPATH)[1:3]

## Get the create table statements.
org.Hs.ipi_dbschema()

## Use "SELECT" SQL query.
selectSQL<-paste("SELECT ipi_id, de",
           "FROM basic",
           "WHERE de like '%histone%'")
tmp3 <- dbGetQuery(org.Hs.ipi_dbconn(), selectSQL)
tmp3[1:3,]
@
\end{enumerate}

\section{Function Description}

\subsection{Getting URL and Version}

To download data file from public database, the first step is getting its URL 
and release/version information. URLs of supported databases are stored in 
"data/sourceURLs.txt". Following functions are used to get the url:
\begin{itemize}
\item \Rfunction{getSrcUrl} - return url according given database.
\item \Rfunction{getALLUrl} - return urls for all databases used in PAnnBuilder 
 packages.
\item \Rfunction{getSrcBuilt} - return release/version according given database.
\item \Rfunction{getALLBuilt} - return release/version information for all 
 databases used in PAnnBuilder packages.
\end{itemize}

\subsection{Parsing and Writing Data}

Parsing is a key step to convert original data file to R object. Sometimes R is 
directly used to parse and write data. But for large data file or complicated 
data format, perl is firstly employed to quickly process data, and then R 
function reads the result file into R objects. 

\subsubsection{Employing perl program to parse data}

Segment of perl program is written into file in "inst/scripts". Name and 
function of these parser files are as follows:
\begin{itemize}
\item spParser - parse protein data from SwissProt or TrEMBL
\item ipiParser - parse protein data from IPI
\item refseqParser - parse protein data from NCBI RefSeq
\item equalParser - find protein ID mapping with equal sequences
\item mergeParser - merge different ID mapping files
\item mppiParser - parse protein protein interaction data from MIPS
\item paParser - parse data from PeptideAtlas
\item dbsublocParser - parse data from DBSubLoc
\item pfamNameParser - parse domain id and name from Pfam
\item blastParser - filter the results of blast
\end{itemize}
Function \Rfunction{fileMuncher} and \Rfunction{fileMuncher\_DB}  perl file 
based on given parser file and additional input data file, then perform this 
perl program via R.

\subsubsection{Writing data using R}

Besides using perl program, R functions also parse data from simple data file and 
store them as R environment objects or Bimap objects.
\begin{itemize}
\item \Rfunction{createEmptyDPkg} - create an empty R package at given directory.
\item \Rfunction{writeSQ} - write sequence data into R package.
\item \Rfunction{writeName} - parse multiple data file, and write the mapping of 
 id and name into R package. It employs \Rfunction{writeGOName}, 
 \Rfunction{writeKEGGName}, \Rfunction{writePFAMName},  
 \Rfunction{writeINTERPROName} and \Rfunction{writeTAXName} to 
 respectively write data from GO, KEGG, Pfam, InterPro and TAX.
\item \Rfunction{writeSCOPData} - parse structural classification of proteins from 
  SCOP database.
\item \Rfunction{writeSubCellData} - parse data from protein subcellular 
  location databases, and write into R package. It employs 
  \Rfunction{writeBACELLOData} and \Rfunction{writeDBSUBLOCData} to 
  respectively write data from BaCelLo and DBSubLoc.
\item \Rfunction{writeIntData} - parse data from protein-protein/domain-domain
  interaction databases, and write into R package. It employs 
  \Rfunction{writeGENEINTData}, \Rfunction{writeINTACTData}, 
  \Rfunction{writeMPPIData}, \Rfunction{write3DIDData} and 
  \Rfunction{writeDOMINEData} to respectively write data from NCBI gene 
  interaction data file, EBI intact, MIPS interaction data, 3DID database 
  and DOMINE database.
\item \Rfunction{writePtmData} - parse database involving protein 
  post-translational modifications, and write into R package. It employs 
  \Rfunction{writeSYSPTMData} to write data from SysPTM database.
\item \Rfunction{writeBfData} - write data involving body fluids proteomics 
  into R package. It employs \Rfunction{writeSYSBODYFLUIDData} to write data
  from Sys-BodyFluid database.
\item \Rfunction{writeGOAData} - write gene ontology terms from GOA database 
  into R package.
\item \Rfunction{writeHomoloGeneData} - write homolog groups from NCBI 
  HomoloGene into R package.
\item \Rfunction{writeInParanoidData} - write paralog groups from InParanoid 
  into R package.
\item \Rfunction{writePeptideAtlasData} - write peptides identified by Mass 
  Spectrometry from PeptideAtlas database into R package.
\item \Rfunction{writeMeta\_DB} - write meta information about the annotation 
  package into SQLite-based package.
\item \Rfunction{writeData\_DB} - parse data from databases, and write data as 
  tables into SQLite-based R package. It employs \Rfunction{writeSPData\_DB}, 
  \Rfunction{writeIPIData\_DB}, \Rfunction{writeREFSEQData\_DB}, 
  \Rfunction{writeGENEINTData\_DB}, \Rfunction{writeINTACTData\_DB}, 
  \Rfunction{writeMPPIData\_DB}, \Rfunction{write3DIDData\_DB},
  \Rfunction{writeDOMINEData\_DB}, \Rfunction{writeSYSBODYFLUIDData\_DB}, 
  \Rfunction{writeSYSPTMData\_DB}, \Rfunction{writeSCOPData\_DB}, 
  \Rfunction{writeBACELLOData\_DB}, \Rfunction{writeDBSUBLOCData\_DB},
  \Rfunction{writeGOAData\_DB}, \Rfunction{writeHomoloGeneData\_DB},
  \Rfunction{writeInParanoidData\_DB}, and \Rfunction{writePeptideAtlasData\_DB}
  to respectively write data from Swiss-Prot, IPI, NCBI RefSeq database, and so on.
\item \Rfunction{writeName\_DB} - parse multiple data file, and write the mapping
  of id and name into SQLite-based R package. It employs \Rfunction{writeGOName\_DB}, 
  \Rfunction{writeKEGGName\_DB}, \Rfunction{writePFAMName\_DB},  
  \Rfunction{writeINTERPROName\_DB} and \Rfunction{writeTAXName\_DB} to 
  respectively write data from GO, KEGG, Pfam, InterPro and TAX.
\item \Rfunction{createSeeds} - define a list of AnnDbBimap objects which indicates
  key and value of .
\item \Rfunction{createAnnObjs} - produce AnnDbBimap objects based on the 
  definition in \Rfunction{createSeeds}.
\end{itemize}

\subsection{Writing Help Documents}

Help documents is an important part for new package. Diverse templates of help 
documents are stored in the "inst/templates" directory. When building new package, 
R functions use these templates to create "*.rd" help file in the "man" directory:
\begin{itemize}
\item \Rfunction{getRepList} - return a list which will replace the symbols in 
 template file.
\item \Rfunction{copyTemplates\_DB} - will produce an "*.rd" file in the "man" 
 subdirectory for SQLite-based annotation package. Related template files in the
 "inst/templates" subdirectory 
 are needed for this function.  . 
\item \Rfunction{writeDescription\_DB} - write description file for SQLite-based 
 annotation package. 
\end{itemize}

\subsection{Building Data Packages}

Basic functions described above make it possible to build proteomic annotation 
data packages. Based on these, PAnnBuilder develops multiple sophisticated functions 
to assemble proteomic annotation data. Each function is implemented by the
"*Builder\_DB" R functions.
\begin{itemize}
\item \Rfunction{pBaseBuilder\_DB} - build annotation data packages for primary
 protein database such as SwissProt, TREMBL, IPI or NCBI RefSeq protein data. 
\item \Rfunction{pSeqBuilder\_DB} - build annotation data packages for query
 protein sequences based on sequence similarity.
\item \Rfunction{crossBuilder\_DB} - build annotation data packages for protein
 id mapping in SwissProt, Trembl, IPI and NCBI Refseq databases. 
\item \Rfunction{subcellBuilder\_DB} - build annotation data packages for
 protein subcellular location from BaCelLo or DBSubLoc database.
\item \Rfunction{HomoloGeneBuilder\_DB} - build annotation data packages for
 homolog protein group from NCBI HomoloGene database.
\item \Rfunction{InParanoidBuilder\_DB} - build annotation data packages for
 ortholog protein group between two given organisms from InParanoid database.
\item \Rfunction{GOABuilder\_DB} - build annotation data packages for mapping
 proteins of UniProt to Gene Ontolgy from GOA database.
\item \Rfunction{scopBuilder\_DB} - build annotation data packages for
 Structural Classification of Proteins.
\item \Rfunction{intBuilder\_DB} - build annotation data packages for
 protein-protein or domain-domain interaction from IntAct, MPPI, 3DID, DOMINE
 or NCBI Gene interaction database.
\item \Rfunction{PeptideAtlasBuilder\_DB} - build annotation data packages for
 experimentally identified peptides from PeptideAtlas database.
\item \Rfunction{ptmBuilder\_DB} - build annotation data packages for
 post-translational modifications from SysPTM database.
\item \Rfunction{bfBuilder\_DB} - build annotation data packages for proteins
 in body fluids from SysBodyFluid database.
\item \Rfunction{dNameBuilder\_DB} - build annotation data packages for mapping
 between entry ID and name from GO, KEGG, Pfam, InterPro and NCBI Taxonomy
 databases.
\end{itemize}

\section{Building Annotation Data Packages}

\begin{enumerate}
\item The first thing you need to do is setting basic parameters such as "pkgpath", 
"version", and "author".
<<parameter>>=
# Set path, version and author for the package.
library(PAnnBuilder)
pkgPath <- tempdir()
version <- "1.0.0"
author <- list()
author[["authors"]] <- "Hong Li"
author[["maintainer"]] <- "Hong Li <sysptm@gmail.com>"
@
\item Then you can run diverse "*Builder\_DB" functions to build packages
 by yourselves.  \Rfunction{pBaseBuilder\_DB}, 
\Rfunction{subcellBuilder\_DB}, and
\Rfunction{pSeqBuilder\_DB} are taken as examples to build annotation packages.
\begin{description}
\item[pBaseBuilder\_DB] builds annotation data packages for proteins in three primary
 protein databases (SwissProt, IPI, RefSeq). It is a convenient way to obtain 
 complete and canonical annotation, including protein description, Entrez gene 
 identifier, KEGG pathway, gene ontology, domain, coordinates on chromosomes 
 and so on. For example, if you want to build annotation package for Mouse IPI 
 database, you can use codes as follows:
<<pBaseBuilder, eval=FALSE>>=
 ## Build SQLite based annotation package "org.Mm.ipi.db" 
 ## for Mouse IPI database.
 ## Note: Perl is needed for parsing data file. 
 ##       Rtools is needed for Windows user.
 pBaseBuilder_DB(baseMapType = "ipi", organism = "Mus musculus",  
                 prefix = "org.Mm.ipi", pkgPath = pkgPath, version = version, 
                 author = author)              
@
After running, a subdirectory called "org.Mm.ipi" will be produced in the path 
given by "pkgPath". This directory contains all data and files, which can be 
used to build R package by "R CMD build" command. 

\item[subcellBuilder\_DB] builds annotation data Package which
 provides protein subcellular location information.
<<subcellBaseBuilder>>=
 ## Build subcellular location annotation package "sc.bacello.db" 
 ## from BaCelLo database.
 subcellBuilder_DB(src="BaCelLo", prefix="sc.bacello", 
                pkgPath, version, author)
  ## List all files in created directory "sc.bacello.db".
  dir(file.path(pkgPath,"sc.bacello.db"))
@

\item[pSeqBuilder\_DB] uses blast to calculate sequence similarity between query 
proteins and subject proteins, then assign annotation for query protein
according to existing annotation of its similar proteins. pSeqBuilder\_DB is
useful for  proteins which have not well annotated. Following code chunk gives
an example for annotation query proteins by pSeqBuilder\_DB. Needed R packages
\Rpackage{org.Hs.sp.db},  \Rpackage{org.Hs.ipi.db}, can be downloaded via biocLite.
<<pSeqBuilder, eval=FALSE>>=
## Read query sequence.
tmp = system.file("data", "query.example", package="PAnnBuilder")
tmp = readLines(tmp)
tag = grep("^>",tmp)
query <- sapply(1:(length(tag)-1), function(x){ 
     paste(tmp[(tag[x]+1):(tag[x+1]-1)], collapse="") })
query <- c(query, paste(tmp[(tag[length(tag)]+1):length(tmp)], collapse="") )
names(query) = sub(">","",tmp[tag])
## Set parameters for sequence similarity.
blast <- c("blastp", "10.0", "BLOSUM62", "0", "-1", "-1", "T", "F")
names(blast) <- c("p","e","M","W","G","E","U","F")
match <- c(0.00001, 0.9, 0.9)
names(match) <- c("e","c","i")

## Install packages "org.Hs.sp.db", "org.Hs.ipi.db".
if( !require("org.Hs.sp.db") ){
  biocLite("org.Hs.sp.db")
}
if( !require("org.Hs.ipi.db") ){
  biocLite("org.Hs.ipi.db")
}

## Use packages "org.Hs.sp.db", "org.Hs.ipi.db" to produce annotation  
## R package for query sequence.
annPkgs <- c("org.Hs.sp.db","org.Hs.ipi.db")  
seqName <- c("org.Hs.spSEQ","org.Hs.ipiSEQ")  
pSeqBuilder_DB(query, annPkgs, seqName, blast, match, prefix="test1",
               pkgPath, version, author)  
@
\end{description}

\item After the running of "*Builder" or "*Builder\_DB" function has been 
finished, a subdirectory named "pkgName" will be produced in given "pkgPath". 
Then the command "R CMD build" can be used to build source R package, and 
"R CMD --binary build" can be used to build binary R package for Windows. 

\item Note: 
\begin{itemize}
\item Web connection is needed to download files from public databases, and Perl 
is needed to parse data files. Additionally, Rtools is needed for Windows user.
\item Users should be aware that downloading, parsing, and saving data may take
a long time, in addition to requiring enough disk space to store temporary 
data files. 
\item "R CMD build" and "R CMD --binary build" should be used in command line, 
not in R. Detailed document about how to create your own packages can be found 
in the book "Writing R Extensions" ({\url{http://cran.r-project.org/doc/manuals/R-exts.pdf}}). 
For Windows users, "R CMD build" needs you to have installed the files for building 
source packages (which is the default), as well as the Windows toolset (see the 
"R Installation and Administration" manual at {\url{http://cran.r-project.org/doc/manuals/R-admin.pdf}}).
\end{itemize}

\end{enumerate}

\section{Session Information}

This vignette was generated using the following package versions:
<<echo=FALSE>>=
sessionInfo()
@

\bibliographystyle{plainnat}
\bibliography{PAnnBuilder}

\end{document}