数据集 数据元素 非结构化数据 Simple ordered sequential file Sequential file structure 顺序文件结构 层级文件结构 Commercial document-retrieval Family tree 系谱图 网络文件结构 Database storage structure Relational structure Relational algebra Logical unit Flat-panel display 关系结构 关系代数 逻辑单元 平面显示器 图书目录 认知实体 图像分辨率 一次文献 二次文献 文献信息 Full-text retrieval system 全文检索系统 Storage media 存储介质 此资料由梦斯姐编著,刘晶、杜丹、小曼、洪平、定文、吴枫(排名不分先后)等友情参与!!
为了电商的荣誉,大家一定要好好考!90分!0不考!不在是梦想,相信自己!就像相信梦斯一样!!!
第六课
6. Indexing and Vocabulary Control
Indexing may be thought of as a process of labeling items for future reference that involves a lot of care and skills. Considerable order can be introduced into this process by standardizing the terms that are to be used as labels. This standardization is known as vocabulary control, the systematic selection of preferred terms. In a sense, we all exercise vocabulary control whether we are conscious of it or not. The fact that this book is written in English rather than French, Russian or Swahili is evidence of control, standardization, and introduction of order into the communication process. Similarly, when people converse, they normally do so in a natural language that is mutually agreeable to them. However, for indexing purposes, the problem of vocabulary control is not solved by choosing one particular natural language and sticking with it. There are several linguistic problems that have to be taken into consideration. Probably the three most important consideration are (1) synonymy, (2) semantic ambiguity, and (3) the proper choice of generic levels of meaning.
索引可能被认为是一个为以后的参考来标记术语的过程,这个过程包含了许多的关注和技能。相当多的顺序可以通过标准化那些用于做标签的术语从而被引入这个过程。这个标准化作为词汇控制为人们所知道,也就是对感兴趣的术语的系统挑选。在一定意义上,无论我们是否意识到我们都经历了词汇控制。这本书用英语撰写而不是用法语、俄语或斯瓦西里语的事实正是在交流过程中控制、标准化和顺序介绍的证据。同样地,当人们交谈时,他们通常是用一种他们相互认可的自然语言进行的。然而,为了索引的目的,词汇控制的问题并没有通过选择一
个特别的自然语言并坚持使用它来解决。有很多语言问题不得不考虑。或许三个最重要的需要考虑的方面是语义重复、语意含糊和意识一般水平的正确选择。 Synonyms are two or more words having the same meaning; and, obviously, the use of synonyms in an index will lead to the scattering of information throughout the alphabet. For example, employing two identical terms forces users of the index to examine both places to make sure that they have found everything that might be of interest.
同义词就是两个以上具有相同意义的词,很明显地,在索引中同义词的使用会导致信息分散到整个字母表。例如,采用两个相同术语会迫使索引使用者去检查不同位置,以此来保证他们能获取感兴趣的所有信息。
Semantic ambiguity arises when a term or phrase has more than one meaning, and it is very likely the subtlest of the three linguistic considerations mentioned. Terms or phrases that are written the same way but have different meanings are called homographs, and the condition is known as homography. Terms or phrases that sound alike but have different meanings are called homonyms. However, the term homonymy is used with reference to both homographs and homonyms-a clear-cut example of semantic ambiguity. 当一个术语或词组具有不止一个意思时就会出现语义含糊,这个是上面提及的三个语言现象中的最微妙的。那些写法一样但意思不同的术语或词组就叫做同形异义词,这种现象就叫做同形异义性。读音相同但意思不同的术语或词组就叫做同音异义字。然而,语义重复这个术语通常与同形异义词和同音异义字的参考一同使用——这就是语意含糊的一个清晰例子。
Subject heading lists and thesauri are similar in that they both usually consist of alphabetically arranged terms, cross-references, and notes to be of used in indexing or searching a corpus of documents. Some specialists in documentation use the terms authority list, subject heading list, and thesaurus interchangeably because of the evident similarity of design and purpose. However, to do so runs roughshod over distinctions that are not trivial. Most, if not all, modern thesauri treat only a subject of this knowledge, namely, a particular discipline or field of study. However, the preparation of thesauri
generally involves sorting out synonyms and homographs that actually appear in the text of the document being indexed, rather than trying to anticipate the concepts that will be encountered. In short, traditional subject heading lists may have a philosophical basis rivaling that of formal classification, whereas thesauri are usually developed in a manner that is primarily inductive and pragmatic.
主题词和类词词典是像是的,因为他们都包括按字母顺序排序的术语、交叉参考和那些在索引或搜索文献汇编中使用的注释的使用。一些文献学专家互换使用规范表、主题词表和类词词典这三个术语,因为设计和目标有着明显的相似性。然而,这样做会抹杀他们间的重要区别。大多数(如果不是全部的话)现代类词词典视为这门知识的一个子集,也就是,研究的一个特别的学科或领域。然而,类词词典的准备通常包括挑出从被索引的文献文摘中出现的同义词和同形异义词,而不是尝试着预测他们可能遇到的概念。简而言之,传统的主题词规范表可能有一个哲学基础,这些哲学基础与正式的分类相匹配,反之,类此次表通常采用一种基本归纳和实际的方式编写。
By far the most popular and successful technique using tittle or no vocabulary control is key word indexing by computer. Typically, the data (usually titles) are put into machine-readable form and input to a computer programmed in an appropriate manner. The program may identify a word as a string of characters between two blanks and then compare each of these words with an internally stored list of “stop” word. Stop words are those which can be presumed of no index value, such as articles, prepositions, conjunctions, and the like. All words matching a stop word are rejected, while all others are retained.
到目前为止,最流行和最成功的技术是利用计算机的关键字索引,这技术使用了一点或没有使用词汇控制。代表性的是,数据(通常是标题)被翻译成机器可读的形式然后输入一个通过正确方式程序化的计算机。这个程序可能识别一个介于两个空格之间作为字符串的词,然后将每一个词和内部存储的停止词列表中的词进行比较。停止词就是那些可以假定为那些没有任何索引异义的词,例如冠词、
介词、连词等等。所有与停止词匹配的词都被拒绝,而其他的词将被保留。
第七课
Information Acquisition and Recording
Humans receive information with their senses: sounds through hearing; images and test through sight; shape, temperature, and affection through touch; and odors through smell. To interpret the signals received from the senses ,humans have developed and learned complex systems of languages consisting \"alphabets\" of symbols and stimuli and the associated rules of usage .This has enabled them to recognize the objects they see ,understand the messages they read or hear ,and comprehend the signs received through the tactile and olfactory senses.
人类通过他们的感官来获取信息:通过听获取声音信息,看获得图像和文本信息,触摸获得形状、温度和情感信息,闻获得气味。为了解释通过感官获得的信号,人类发明和学习了复杂的语言系统,这些语言系统包括呼号和刺激以及语法联合规则的字符。这就使得他们能够分辨出他们看到的物体,理解他们读到或听到的信息,理解通过触觉和嗅觉感官获得的信息。
For information to be communicated broadly, it needs to be stored external to human memory; accumulation of human experience, knowledge, and learning would be severely limited without such storage. Storage of information external to memory made necessary the development of writing systems.
为了信息能被广泛交流,它就需要被存储到人的记忆外。人的经验、知识和学习的积累如果没有存储在人的记忆之外就会受到很大程度的。信息存储在人记忆之外使得书写系统的发展成为可能。
Civilization can be traced to the time when humans began to associate abstract shape with concepts and with the sounds of speech that represented them. Early recorded representations were those of visually perceived objects that and event, as, for example, the animals and activities depicted in
Paleolithic cave drawings. The evolution of writing systems proceeded through
the early development of pictographic language, in which a symbol would represent an entire concept. Such symbols would go through many
metamorphoses of shape in which the resemblance between each symbol and the object it stood for gradually disappeared, but its semantic meaning would become more precise. As the conceptual world of humankind became large, the symbols, called ideographs, grew in number. Modern Chinese, a present-day result of this evolutionary direction of a pictographic writing system, has upward of 50000 ideographs.
人类文明可以追溯到人类开始将抽象形体与概念和代表这些概念的话语声音联系在一起的时候。早期记录下来的形体表征是哪些眼睛看的见的物体和事件,例如旧石器时代洞穴绘画中描述的动物和活动。书写体系的演变延续了象形文字表达方式的早期发展。在象形文字中,一个符号表达一个完整的概念。这些符号将精力许多形体的变化。在变化中每个象征符号和它所代表的实物之间的相似之处逐渐消失,但其语言的含义将变得更准确。随着人类理念范围的扩大,这些象征符号被人们称为表意文字,其数量也大大增加。现代汉语就是象形文字书写系统逐渐演变过程的现实结果,其表意词有5万多个。
The versatility of modern information systems stems from the their ability to represent information electronically as digital signals and to manipulate it automatically at exceedingly high speeds. Information is stored in binary devices ,which are the basic components of digital technology .Because these devices exist only in one of two states ,information is represented in them either as the absence or the presence of energy (electric pulse ).The two states of binary devices are conveniently designated by the binary digits, or bits ,zero (0) and one (1).
现代信息系统的多样性与灵活性来源于他们利用电子化的数字信号表达信息和以非常高的速度自动管理控制信息的能力。信息存储在二进制的器件里。这些器件是数字技术的基本组成部分。由于这些器件只处于两种状态中的一种状态下,因此表达信息的形式有两种,一种是源(电子脉冲)状态,另一种是有能源(电子脉冲)状态。用二进制数(或比特),即0和1,就可以很方便地指
明二进制器件的两种状态。
Optical discs fall into two classes :(1)prerecorded, exemplified by the compact disc read only memory (CD-ROM),from which data can be read but not modified ;and (2)writable /erasable ,on which data can be both recorded and overwritten .The latter variety is the full functional equivalent of the magnetic disk .It employs magneto-optic technology :the erasable medium is magnetic .Optical storage represent a considerable improvement over magnetic disks in terms of superior recording capability (it accommodates storage of voice ,text ,graphics ,and video ),of storage capacity (2 000 000 000 characters on 12-inch disc ),of portability ,and of low cost -in fact ,the lowest of per -bit storage of any digital storage media .
光盘被分为两种类型,一是预录光盘,例如只读光盘,这种光盘中的数据只能被读取而不能更改;二是可写/可擦除光盘,这种光盘可以被记录和改写。后一种光盘是磁盘的全功能等价物。它采用磁光技术:那可擦除的媒介是有磁性的。就出众的记录能力、存储能力、变形性和低成本而言,光存储表示磁盘取得了相当大的进步,在它具有调节声音、文本、图像和视频的存储,12英寸的鉴别器可以容纳20亿个字符,同时它比任何一种数字存储媒介中每比特存储成本都低。
第八课
Information Analysis and Storage
Digitally stored data may represent alphanumeric, image, or audio information, each requiring different techniques for analyzing content. The objectives of content analysis of alphanumeric information are twofold. At the simple level the aim is to describe the record (document) in terms of some of its properties or characteristics so that the record can be located or assigned to a category of similar records. At a more complex level the objective is to represent the meaning or portent of the document so that its content can be “understood” and possibly further manipulated by machine in a cognitive sense.
数码存储的数据可能是字母和数字的混合字符、图像或者声音信息,每一种都需要用不同的技术来进行内容分析。对字母数字混合字符信息进行内容分析的目的有两层。简单一层的目的是描述这个文件的有关特性,以便对该文件空位或将相似文件划归为一类。复杂的一层的目的是表述文件的意义或含义,以便它的内容从认知角度能广泛的被机器理解和处理,从而得到进一步的控制和管理。 Digital information is stored in complex patterns that make it feasible to address and operate on even the smallest element of symbolic expression, as well as on larger strings such as words or sentences and on images and sound.
数字信息存储在复杂的模式之中,这就使在符号表达式最小元素的定位和操作 成为可能,也使得在较大的字符串例如词句以及图像声音的定位和操作成为可能。
From the viewpoint of digital information storage, it is useful to distinguish between “structured” data, such as inventories of objects that can be represented by short symbol strings and numbers, and “unstructured” data, such as the natural-language text of documents or pictorial images. The principal objective of all storage structures is to facilitate the processing of data elements based on their relationships; the structures thus vary with the type of relationship they represent. The choice of a particular storage structure is governed by the relevance of the relationship it allows to be represented to the information-processing requirements of the task or system at hand.
从数字信息存储的角度来看,区分结构化和非结构化数据时十分有用的,可以通过短字符串和数字来表示物体的详细目录就是结构化数据,文献的自然语言文本或绘画图形是非结构化数据。所有存储结构的主要目标是促进基于他们之间关系的数据元素的处理。这些结构会随着他们表示的关系类型的变化而变化。特定存储结构的选择是由关系的相关性决定的,这结构允许关系被表示为任务或即将到来的系统的信息处理需求。
The feasibility of storing large volumes of full text on an economic medium (the digital optical disc) has renewed interest in the study of storage structures that permit more powerful retrieval and processing techniques to operate on
cognitive entities other than words, to facilitate more extensive semantic content and context analysis, and to organize text conceptually into logical units rather than those dictated by printing conventions.
把大量的全文文本存储在一个经济的媒介上(光盘)的可行性重新激发了人们研究存储结构的兴趣。这些存储结构允许强大的检索和处理技术手段来对文字以外的可认知实体进行操作,促进更全面的语义内容和上下文的分析,而且按概念把正文组织成逻辑单元,而不是按印刷惯例所要求的那样去组织。
第九课
Information Display and Dissemination
Print. Modern society continues to be dominated by printed information. The convenience and portability of print on paper make it difficult to image the paperless world that some have predicted. The generation of paper print has changed considerably, however. Although manual typesetting is still practiced for artwork, in special situations, and in some developing countries, electronic means of composing pages for subsequent reproduction by photo-duplication and other methods has become commonplace.
打印。现代社会继续被打印的信息所支配。将内容打印在纸上的便利性和便携性使得有些人已经预测了的无纸化世界变得很难想象。然而一代代打印纸有着相当大的改变。尽管人工排版仍然在一定特殊的情景下应用于艺术品中,而且在一些发展中国家,通过照片复制和一些其他方法后来的复制组成页面的电子方式已经变得习以为常。
The process of recording information by handwriting was obviously laborious and required the dedication of the likes of Egyptian scribes or monks in monasteries around the world. It was only after mechanical means of reproducing writing were invented that information records could be duplicated more efficiently and economically.
用手写记录信息的过程很明显是相当费力的,并且需要像古埃及文牍人员或世界各地寺庙里的僧侣一样的无私奉献精神。只有在机械复制文件的方法出现之
后,信息才能够更有效、更经济地复制出来。
Printing from movable type was also invented in China(in the mid-11th century AD).There and in the book-making industry of Korea, where the method was applied more extensively during the 15th century, the ideographic type was made initially of baked clay and wood and later of metal. The large number of typefaces required for pictographic text composition continued to handican printing in the Far East until the present time.
活字印刷也是中国发明的(在公元11世纪中叶)。在15世纪中国和韩国的造书市场中,活字印刷术被更为广泛地应用,表意类型最初是由焙土、木头然后是金属制成的。
While the volume of information issued in the form of printed matter continues unabated, the electronic publishing industry has begun to disseminate information in digital form. The development of the microcomputer has provided the main impetus. Not only is such a system almost as versatile and efficient as larger computers in information retrieval, but it carries out this operation relatively economically. Certain types of information media lend themselves particular well to distribution in digital form. These include catalogs, handbooks, indexes, databases, and reference materials designed to be consulted rather than read in toto. Computer software also is well suited for distribution via electronic publishing.
一方面以印刷品的形式出版的信息量继续增长,另一方面电子印刷业也开始用数字形式传播信息。微型计算机的发展提供了主要推动力。在信息检索方面,这一系统不仅几乎和大型计算机一样用途广泛且高效,而且运作起来非常经济。某些形式的信息媒体非常适合数码形式的信息传播方式,这些信息媒体包括目录、手册、索引、数据库和其他供查询而不是供全文阅读而设计的参考资料。计算机软件同样十分适合以电子出版的方式来进行发行。
第十课
Basic Concepts of Information Retrieval Systems
The concept of information retrieval presupposes that there are some
documents or records containing information that have been organized in an order suitable for easy retrieval. The documents or records we are concerned with contain bibliographic information which is quite different from other kinds of information or date. We may take a simple example. If we have a database of information pertaining to an office, or a supermarket, all we have are the different kinds of records and related facts, like names of employees, their positions, salary, and so on. The retrieval system here is designed to search for and retrieve specific facts or data, like the salary of a particular manager, or the price of a perfume, and so on.
信息检索的概念是以有很多文献和记录包含信息为先决条件的,为了简单检索这些信息按照合适的顺序进行组织。这些我们关心的文献和记录包含著书目录的信息,而这些信息与其他种类的信息和数据具有很大的不同。我们可以举个简单的例子。如果我们有一个关于办公或超市信息的数据库,所有我们拥有的是不同种类的记录和相关事实,例如雇员的名字、职位、工资等等。这里的检索系统是设计来搜索和检索具体事实或数据的,例如一个特定管理者的工资,或香水价格等。
The major objective of a bibliographic information retrieval system, however, is to retrieve bibliographic details of those items containing the user's required information. The database here comprises bibliographic records of stored documents. It may also contain abstracts of full texts of documents, like newspaper articles, handbooks, dictionaries, encyclopedias, legal documents, statistics, and so on. In such a situation, the, the system is designed to retrieve the actual test that , supposedly, would satisfy information requirement(s) of the user concerned .This is called a full-text retrieval system, because it deals with the actual text of documents and finally retrieves the actual information (in the form of text).
然而,著书目录信息检索系统的主要目标是检索那些包含使用者需要的信息的术语的著书目录细节。这里的数据库包含存储文献的著书目录记录。它可能也包含文献的摘要或全文,例如报纸文章、手册、字典、百科全书、法律文献、统计
学等。在这样一个情境下,按照推测系统是被设计用来检索那些能满足用户需求的实际文章。这就叫做全文检索系统,因为它处理文献的实际正文,然后检索实际信息。
Whatever may be the nature of the database-bibliographic or full-text-the system presupposes that there is a group of users for whom the system is designed. Users are considered to have certain queries or information needs, and when they put forward their requirement to the system, the latter should be able to provide the necessary bibliographic references of those documents containing either the required information, or the actual text in the case of a full-text retrieval system. Alternative models of (knowledge-based) information retrieval seek to provide the user with the information directly rather than just the citations, the abstract or the full text.
无论什么是数据库著书目录或全文系统的本质,都预测存在专一群使用者,这些使用者正是这系统被设计的目标客户。使用者被认为有某些问题或信息需求,当他们对系统提出他们的需求,信息需求应该能提供那些包含需要的信息或就一个全文检索系统来说的实际文本的必要著书目录参考。(基于知识的)信息检索的选择模型追寻直接为使用者提供信息而不仅仅是引用、摘要或全文。
因篇幅问题不能全部显示,请点此查看更多更全内容