The dataset can be downloaded here.
Each file represents a DNA-binding protein sequence. Each residue in the sequence is labeled as interface (+1, or binding residue) and non-inteface (-1, or non-binding residue). The files are given in the .arff format supported by WEKA, an open source software, containing a collection of machine learning algorithms for data mining tasks.
For more information about dataset construction, please see our papers:
Lee. J-H., Hamilton, M., Gleeson, C., Caragea, C., Zaback, P., Sander, J., Lee, X., Wu, F., Terribilini, M., Honavar, V. and Dobbs, D. Striking Similarities in Diverse Telomerase Proteins Revealed by Combining Structure Prediction and Machine Learning Approaches. In Proceedings of the Pacific Symposium on Biocomputing (PSB 2008). In press.
Yan, C., Terribilini, M., Wu, F., Jernigan, R., Dobbs, D., and Honavar, V.