This represents a novel approach to functional annotation and potentially

Having this numerical measure of function specificity enables our statistical model to make predictions about function specificity and function similarity Senkyunolide-I between GO terms. The model described here serves as a novel tool for protein annotation by predicting the specificity of function, based on the GO hierarchy, which may be shared between two proteins for a given level of sequence similarity. Through the statistical modeling process we shed light on the variability in the relationship between sequence similarity and function similarity. In addition, we demonstrate the usefulness of our model through two use cases: evaluating existing protein Sec-O-Glucosylhamaudol functional annotations based on predictive methods currently residing in protein databases and providing possible annotations for thousands of hypothetical proteins. The models used do not have to be overly complex, GLMs and GAMs are relatively simple to understand and deploy. What matters is that they are developed appropriately. In this study we use GLMs and GAMs to model the relationship between the sequence similarity between proteins and their function similarity. This represents a novel approach to functional annotation and potentially more accurate than current methods based on sequence similarity thresholds which do not account for the degree of function specificity which can be transferred between proteins over a wide range of sequence similarity. Our annotation model accounts for the fact that the function similarity between two proteins generally increases as their sequence similarity increases over a broad range of BLAST bit scores. Using our annotation model we demonstrated that statistical models trained with experimental data generally predict lower functional similarity, over a range of BLAST bit scores, than those trained with electronic data. This suggests that the sequence similarity threshold applied in many electronic annotations may be below the degree of sequence similarity required to transfer exact and specific functions from experimentally characterized proteins, at least for moderate bit score ranges.