Home Editor's page Guide Hardware Services Software Standards Addresses Advertising rates Contact Us
The ggbaker.com encyclopedia
 

 

4. Indexing and retrieval

The indexing system that has been adopted for the document management system will dictate how much time must be spent preparing documents for input. Documents in electronic formats usually contain sufficient machine readable information for filing and retrieval but paper may have to be sorted into batches and classified prior to input.

Creating an index from scanned input

Simple scanning of a document produces a digital image which is really just a series of dots. To enable automated indexing it is usually necessary to employ optical character recognition (OCR), mark recognition or barcode to read all or a selected portion of each document. This can be done as an integral part of the scan or as a subsequent operation to produce machine-readable data about the document. An important advantage of this process is that a secure digital image is created which is difficult to edit and suitable for use in evidence while sufficient metadata is extracted to enable indexing and retrieval, automatic circulation to users, retention and eventual destruction. Inexpensive document scanners are not normally equipped with recognition or barcode facilities as standard, but they may be available as optional extras. The data sheets in the commercial section of this site attempt to indicate what features each machine offers as standard and as options.

Indexing is a complex, time consuming and expensive necessity which is too often under-estimated when designing and scheduling systems and estimating costs. Too little indexing renders potentially valuable information useless, because it can not be accessed as and when required, but over-elaborate indexing is an inexcusable waste of time and money.

One of the advantages of  thorough preparation for the introduction of an EDM system is that all potential users, and the way in which they need access, will have been identified for each document type. Although there are many levels of sophistication available, most indexing systems will fall into one of three broad categories.

Typical index types

Simple (flat-file) indexes rely on the card index concept. Each document is allocated a unique index record and the terms by which it will be retrieved are listed on that record. An additional document with similar characteristics will also have a unique record. This can be effective for small applications but, because information common to many records is entered again and again, it is bulky and unsuitable for large systems.

Relational databases have been developed to overcome this unnecessary duplication of identical data. Michael Halvarson and Michael Young of Microsoft express the benefits very clearly. "Structuring your data in a relational format has a number of advantages. You'll save considerable time by not having to enter the same data again and again across many records. Your database will be smaller, often a fraction of the size of a flat-file database, saving space on your system and making the database more portable if you want to share it with others. Data entry errors will be greatly reduced - how many times can you type Thermodynamics Theory into a Class Name field without error?  If the repeated data is stored in a related table, you need to enter the correct information just once; then, in the original table, you enter only the identifier of the information - usually a short numeric or alphanumeric code - each time the repeated data occurs. What you do need to understand is that a "field" is a category of information, an "entry" is the information that goes into a field for a single record, a "record" consists of the related entries for an individual item in the database (and fills up a row within a table), and you can set up relationships between separate tables so that you need enter repeated information only once." We think that this is an excellent description.

The third broad type of indexing option is hypertext, which is familiar to anyone using the World Wide Web. It is best suited to research and knowledge management applications when the requirement is not so much for a specific record but for any or all records containing relevant information. It allows users to browse through hundreds or even millions of documents, hopping from one to another by clicking an underlined hyperlink. Its effectiveness depends on the quality of the editorial effort employed in setting up the links; for many applications that has proved difficult to automate.

Metadata: Every record in any system needs two types of data linked to it; data used to manage and control the document within the system and data employed by users for search and retrieval purposes. Control may involve automatic distribution of new documents to those known to need them, the introduction of codes to limit access to authorised users only, methods of  logging each reference to the document, maintenance of audit trails, data linking each stage in the development or amendment of a document, automatic movement from one storage medium to another at pre-arranged points in the document life cycle and possible subsequent automatic destruction. This data is normally entered into fixed length fields and held in a relational database.

Data employed for search and retrieval can also be held in the same database if the terms that users will employ for retrieval can be identified. Invoices, for example, can normally be indexed by a limited number of fields such as date, number, customer name or ID, total amount etc. Research and knowledge type documents are not so easy to classify because it is difficult to predict which part of the content will be of future value and how the information will be requested; in such cases free text searching techniques can be employed. The concept relies on searching keywords, an abstract of the document, or its complete text. In the case of digital documents this is fairly simple to organise, but when input arrives in the form of scanned pages recognition software must be employed to code the text before it can be used for free text searches.

The information above is greatly simplified and expert guidance is essential to ensure that the most appropriate method of control and indexing is adopted. The important point to note is that indexing requirements will greatly influence the choice of suitable Electronic Document  Management Software and hardware, so they must be fully researched before a system can be selected or a scanner is purchased.

(next chapter)  (back to top)

 

Content

Preface

1. Introduction to DM

2. Planning a DM system

3. Input and Output Methods

4. Indexing and retrieval

5. Management and control

6. Storage and preservation

7. Hybrid systems

8. Microfilm systems

9. Services available

10. Software

     Webmaster: Gerald Baker - gerald@ggbaker.co.uk              Last updated 3/1/12            © G G Baker & Associates, 2010                 joomla visitors