参考文献
[1] 李国杰,程学旗.大数据研究:未来科技及经济社会发展的重大战略领域.中国科学院院刊,2012,27(6):647-657.
[2] Howe D, Costanzo M, Fey P, et al. Big data: the future of biocuration. Nature, 2008, 455 (7209): 47-50.
[3] Staff S, Dealing with data: challenges and opportunities. Science, 2011, 331 (6018): 692-693.
[4] Holland JH. Emergence: from chaos to order. Boston, MA: Addison-Wesley, 1997.
[5] Hey T, Tansley S. The fourth paradigm: data-intensive scientific discovery. Microsoft Research, 2009.
[6] Phan X-H, Nguyen L-M, Horiguchi S. Learning to classify short and sparse text & Web with hidden topics from large-scale data collections. In: Proc. of the 17th Int’l Conf. on World Wide Web. Beijing, 2008. 91-100.
[7] Sahami M, Heilman TD. A Web-based kernel function for measuring the similarity of short text snippets. In: Proc. of the 15th Int’l Conf. on World Wide Web. Edinburgh, 2006. 377-386.
[8] Efron M, Organisciak P, Fenlon K. Improving retrieval of short texts through document expansion. In: Proc. of the 35th Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Portland, 2012. 911-920.
[9] Hong L, Ahmed A, Gurumurthy S, et al. Discovering geographical topics in the Twitter stream. In: Proc. of the 21st Int’l Conf. on World Wide Web (WWW 2012). Lyon, 2012. 769-778.
[10] Pozdnoukhov A, Kaiser C. Space-time dynamics of topics in streaming text. In: Proc. of the 3rd ACM SIGSPATIAL Int’l Workshop on Location-based Social Networks. Chicago, 2011. 1-8.
[11] Sun Y-Z, Norick B, Han J-W, et al. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In: Proc. of the 18th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining. Beijing, 2012. 1348-1356.
[12] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New York: Springer-Verlag, 2009.
[13] Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 2009, 37 (1): 246-270.
[14] 周傲英,金澈清,王国仁,李建中.不确定性数据管理技术研究综述.计算机学报,2009,32(1):1-16.
[15] Abiteboul S, Kanellakis P C, Grahne G. On the representation and querying of sets of possible worlds. Theoretical Computer Science, 1991, 78 (1): 158-187.
[16] Koller D, Friedman N. Probabilistic graphical models: principles and techniques. Cambridge, MA: The MIT Press, 2009.(概率图模型. 王飞跃,韩素青,译. 北京: 清华大学出版社,2015.)
[17] Aggarwal CC. Managing and mining uncertain data. Berlin: Springer-Verlag, 2009.
[18] Wang Q, Xu J, Li H, Craswell N. Regularized latent semantic indexing. In: Proc. of the 34th Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR’11). Beijing, 2011. 685-694.
[19] Mackey L, Talwalkar A, Jordan MI. Divide-and-conquer matrix factorization. Advances in Neural Information Processing Systems (NIPS) , 2012, 24: 0669.
[20] Gershman S, Blei D. A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology, 2012, 56: 1-12.
[21] Kulis B, Jordan MI. Revisitingk-means: new algorithms via Bayesian nonparametrics. In: Proc. of the 29th Int’l Conf. on Machine Learning (ICML). Edinburgh, 2012.
[22] Bar-Yam Y. A mathematical theory of strong emergence using multiscale variety. Complexity, 2004, 9 (6): 15-24.
[23] Bedau MA. Weak emergence. Noûs, 1997, 31 (s11): 375-399.
[24] Chalmers DJ. Strong and weak emergence. Oxford: Oxford University Press, 2002.
[25] Henrya AD, Prałat P, Zhang CQ. Emergence of segregation in evolving social networks. Proc. of the National Academy of Sciences, 2011, 108 (21): 8605-8610.
[26] Bergman MK. White Paper: The Deep Web: surfacing hidden value. Journal of Electronic Publishing, 2001, 7 (1) , DOI: 10. 3998/3336451. 0007. 104.
[27] Florescu D, Levy A, Mendelzon A. Database techniques for the World-Wide-Web: a survey. SIGMOD Record, 1998, 27 (3): 59-74.
[28] Fan WF. Data quality: theory and practice. In: Proc. of the 2012 Int’l Conf. on Web-Age Information Management (WAIM’12). Harbin, 2012. 1-16.
[29] Fan WF, Geerts F. Foundations of data quality management. Synthesis Lectures on Data Management, 2012, 4 (5): 1-217.
[30] Fan WF. Dependencies revisited for improving data quality. In: Proc. of the 27th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems (PODS’08). Vancuver, 2008. 159-170.
[31] Dasgupta A, Das G, Mannila H. A random walk approach to sampling hidden databases. In: Proc. of the 2007 ACM SIGMOD Int’l Conf. on Management of Data. Beijing, 2007. 629-640.
[32] 刘伟,孟小峰,凌妍妍.一种基于图模型的Web数据库采样方法.软件学报,2008,19(2):179-193.
[33] Wang RY, Ben HB, Madnick S E. Data quality requirements analysis and modeling. In: Proc. of the 9th Int’l Conf. on Data Engineering. Vienna, 1993. 670-677.
[34] Galhardas H, Florescu D, Shasha D, Simon E. AJAX: an extensible data cleaning tool. ACM SIGMOD Record, 2000, 29 (2): 590.
[35] 郭志懋,周傲英.数据质量和数据清洗研究综述.软件学报,2002,13(11):2076-2082.
[36] Hernandez MA, Stolfo SJ. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1998, 2 (1): 9-37.
[37] Abiteboul S, Cluet S, Milo T, et al. Tools for data translation and integration. IEEE Data Engineering Bulletin, 1999, 22 (1): 3-8.
[38] Milo T, Zohar S. Using schema matching to simplify heterogeneous data translation. In: Proc. of the 24th Int’l Conf. on Very Large Data Bases. New York, 1998. 122-133.
[39] Deerwester S, Dumais ST, Furnas GW, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41 (6): 391-407.
[40] Hofmann T. Probabilistic latent semantic indexing. In: Proc. of the 22nd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval. Berkeley, 1999. 50-57.
[41] Guo JF, Xu F, Cheng XQ, et al. Named entity recognition in query. In: Proc. of the 32nd Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR’09). Boston, 2009. 267-274.
[42] Chang F, Dean J, Ghemawat S, et al. A distributed storage system for structures data. In: Proc. of the 7th Symp. on Operating Systems Design and Implementation. Seattle, 2006. 205-218.
[43] Abadi D, Madden S, Hachem N. Column-stores, row-stores. how different are they really? In: Proc. of the 2008 ACM SIGMOD Int’l Conf. on Management of Data. Vancouver, 2008. 967-980.
[44] He YQ, Lee R, Huai Y, et al. RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proc. of the IEEE 27th Int’l Conf. on Data Engineering (ICDE). Hannover, 2011. 1199-1208.
[45] Zou YQ, Liu J, Wang SC, et al. CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries. In: Proc. of the Network and Parallel Computing. Zhengzhou, 2010. 247-261.
[46] Gao M, Jin CQ, Wwang XL, et al. A survey on management of data provenance. Chinese Journal of Computers, 2010, 33 (3): 373-389.
[47] 高明,金澈清,王晓玲,等.数据世系管理技术研究综述.计算机学报,2010,33(3):373-389.
[48] Buneman P, Khanna S, Tan WC. Data provenance: some basic issues. In: Proc. of Int’l Conf. on Foundation of Software Technology and Theoretical Computer Science. New Delhi, 2000. 87-93.
[49] Tan WC. Provenance in databases: past, current, and future. IEEE Data Engineering Bulletin, 2007, 30 (4): 3-12.
[50] 宫学庆,金澈清,王晓玲,等.数据密集型科学与工程:需求和挑战.计算机学报,2012,35(8):1563-1578.
[51] Li P, Burges C, Wu Q. MCRank: learning to rank using multiple classification and gradient boosting. Advances in Neural Information Processing Systems (NIPS’07) , 2007, 19: 845-852.
[52] Freund Y, Iver R, Schapire RE, Singer Y. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 2003, 4: 933-969.
[53] Burges C, Shaked T, Renshaw E. Learning to rank using gradient descent. In: Proc. of the Int’l Conf. on Machine Learning, ICML’05. Bonn, 2005. 89-96.
[54] Cao Z, Qin T, Liu T-Y. Learning to rank: from pairwise approach to listwise approach. In: Proc. ofthe Int’l Conf. on Machine Learning (ICML’07). Corvallis, 2007. 129-136.
[55] Xu J, Li H. AdaRank: a boosting algorithm for information retrieval. In: Proc. of the 31st Int’l ACM SIGIR Conf. (SIGIR’07). Amsterdam, 2007. 391-398.
[56] Yue Y, Finley T, Radlinski F. A support vector method for optimizing average precision. In: Proc. of the 31rd Int’l ACM SIGIR Conf. (SIGIR’07). Amsterdam, 2007. 271-278.
[57] 程学旗,郭嘉丰,靳小龙.网络信息的检索与挖掘回顾.中文信息学报,2011,25(6):111-117.
[58] Yangarber R, Grishman R. NYU: description of the Proteus/PET system as used for MUC-7. In: Proc. of the 7th Message Understanding Conf. (MUC’98). Fairfax, 1998.
[59] Zelenko D, Aone C, Richardella A. Kernel methods for relation extraction. Journal of Machine Learning Research, 2003, 3: 1083-1106.
[60] Getoor L, Taskar B. Introduction to statistical relational learning. In: Adaptive Computation and Machine Learning, Cambridge, MA: The MIT Press, 2007.
[61] 沈华伟,靳小龙,任福新,等.面向社会媒体的舆情分析.中国计算机学会通讯,2012,8(4):32-36.
[62] Girvan M, Newman MEJ. Community structure in social and biological networks. Proc. of the National Academy of Sciences, 2002, 99 (12): 7821-7826.
[63] Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Physical Review E, 2004, 69 (2): 026113.
[64] Palla G, Derényi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 2005, 435 (7043): 814-818.
[65] Shen HW, Cheng XQ, Cai K, Hu MB. Detect overlapping and hierarchical community structure in networks. Physica A, 2009, 388 (8): 1706-1712.
[66] Shen HW, Cheng XQ, Guo JF. Quantifying and identifying the overlapping community structure in networks. Journal of Statistical Mechanics: Theory and Experiment, 2009, (7): P07042.
[67] Barabasi AL, Albert R. Emergence of scaling in random networks. Science, 1999, 286 (5439): 509-512.
[68] Palla G, Barabási AL, Vicsek T. Quantifying the social group evolution. Nature, 2007, 446 (7136): 664-667.
[69] Wu WT, Li HS, Wang XS, Zhu K. Probase: a probabilistic taxonomy for text understanding. In: Proc. of the 2012 Int’l Conf. on Management of Data (SIGMOD). Scottsdale, 2012. 481-492.