课程概况
搜索引擎是管理和挖掘大文本数据的重要工具。通过这门课,我们将学习搜索引擎的工作原理,主要的搜索算法以及如何优化搜索精度。
自然语言文本数据近年来一直呈现迅猛发展的态势,其中包括网页、新闻消息、科学文献、电子邮件、企业文件和社会媒体,如微博文章、论坛发帖、产品评论和推特。直接生成文本数据的通常是人类自己,而不是电脑系统或传感器,所以它是独一无二的,而且除了其他许多我们能够进行文本编码的知识以外,文本数据对于发现有关大众看法和偏好的相关知识尤为重要。
课上讲到的搜索引擎技术,在所有涉及文本数据的数据挖掘应用中扮演着重要角色,原因有两个:第一,虽然对于特定问题来说,原始数据可能太大了,但它往往是一个相对较小的相关数据的子集,而在较大的文本集合中,搜索引擎正是快速发现相关文本数据的小型子集的重要工具 。第二,搜索引擎通过检查相关的原始文本数据来了解被发现的模式,从而帮助分析员解释数据中被发现的所有模式。课上,我们将学习文本检索领域的基本概念、原理和主要技术,这些都是关于搜索引擎的基础科学知识。
Recent years have seen a dramatic growth of natural language text data, including web pages, news articles, scientific literature, emails, enterprise documents, and social media such as blog articles, forum posts, product reviews, and tweets. Text data are unique in that they are usually generated directly by humans rather than a computer system or sensors, and are thus especially valuable for discovering knowledge about people’s opinions and preferences, in addition to many other kinds of knowledge that we encode in text.
This course will cover search engine technologies, which play an important role in any data mining applications involving text data for two reasons. First, while the raw data may be large for any particular problem, it is often a relatively small subset of the data that are relevant, and a search engine is an essential tool for quickly discovering a small subset of relevant text data in a large text collection. Second, search engines are needed to help analysts interpret any patterns discovered in the data by allowing them to examine the relevant original text data to make sense of any discovered pattern. You will learn the basic concepts, principles, and the major techniques in text retrieval, which is the underlying science of search engines.
课程大纲
文本数据挖掘简介
文本检索的基本概念
信息检索模型
搜索引擎的实现
搜索引擎的评估
搜索引擎的先进技术
预备知识
了解数据结构的基本知识,最好精通C++或Java编程语言;熟悉概率和统计的基本知识会对学习所有帮助,但不做强制性要求。