4.
Indexing and retrieval
The indexing system that has been adopted for the document
management system will dictate how much time must be spent preparing
documents for input. Documents in electronic formats usually contain
sufficient machine readable information for filing and retrieval but paper
may have to be sorted into batches and classified prior to input.
Creating
an index from scanned input
Simple scanning of a document produces a
digital image which is really just a series of dots. To enable automated
indexing it is usually necessary to employ optical character recognition
(OCR), mark recognition or barcode to read all or a selected portion of each
document. This can be done as an integral part of the scan or as a
subsequent operation to produce machine-readable data about the document. An
important advantage of this process is that a secure digital image is
created which is difficult to edit and suitable for use in evidence while
sufficient metadata is extracted to enable indexing and retrieval, automatic
circulation to users, retention and eventual destruction. Inexpensive
document scanners are not normally equipped with recognition or barcode
facilities as standard, but they may be available as optional extras. The
data sheets in the commercial section of this site attempt to indicate what
features each machine offers as standard and as options.
Indexing is a complex, time consuming and
expensive necessity which is too often under-estimated when designing and
scheduling systems and estimating costs. Too little indexing renders
potentially valuable information useless, because it can not be accessed as
and when required, but over-elaborate indexing is an inexcusable waste of
time and money.
One of the advantages of thorough preparation for
the introduction of an EDM system is that all potential users, and the way
in which they need access, will have been identified for each document type.
Although there are many levels of sophistication available, most indexing
systems will fall into one of three broad categories.
Typical index types
Simple (flat-file) indexes rely on the card index concept.
Each document is allocated a unique index record and the terms by which it
will be retrieved are listed on that record. An additional document with
similar characteristics will also have a unique record. This can be
effective for small applications but, because information common to many
records is entered again and again, it is bulky and unsuitable for large
systems.
Relational databases have been developed to overcome this
unnecessary duplication of identical data. Michael Halvarson and Michael
Young of Microsoft express the benefits very clearly. "Structuring your data
in a relational format has a number of advantages. You'll save considerable
time by not having to enter the same data again and again across many
records. Your database will be smaller, often a fraction of the size of a
flat-file database, saving space on your system and making the database more
portable if you want to share it with others. Data entry errors will be
greatly reduced - how many times can you type Thermodynamics Theory into a
Class Name field without error? If the repeated data is stored in a
related table, you need to enter the correct information just once; then, in
the original table, you enter only the identifier of the information -
usually a short numeric or alphanumeric code - each time the repeated data
occurs. What you do need to understand is that a "field" is a category of
information, an "entry" is the information that goes into a field for a
single record, a "record" consists of the related entries for an individual
item in the database (and fills up a row within a table), and you can set up
relationships between separate tables so that you need enter repeated
information only once." We think that this is an excellent description.
The third broad type of indexing option is hypertext,
which is familiar to anyone using the World Wide Web. It is best suited to
research and knowledge management applications when the requirement is not
so much for a specific record but for any or all records containing relevant
information. It allows users to browse through hundreds or even millions of
documents, hopping from one to another by clicking an underlined hyperlink.
Its effectiveness depends on the quality of the editorial effort employed in
setting up the links; for many applications that has proved difficult to
automate.
Metadata: Every record in
any system needs two types of data linked to it; data used to manage and
control the document within the system and data employed by users for search
and retrieval purposes. Control may involve automatic distribution of new
documents to those known to need them, the introduction of codes to limit
access to authorised users only, methods of logging each reference to
the document, maintenance of audit trails, data linking each stage in the
development or amendment of a document, automatic movement from one storage
medium to another at pre-arranged points in the document life cycle and
possible subsequent automatic destruction. This data is normally entered
into fixed length fields and held in a relational database.
Data employed for search and retrieval can also be held in
the same database if the terms that users will employ for retrieval can be
identified. Invoices, for example, can normally be indexed by a limited
number of fields such as date, number, customer name or ID, total amount
etc. Research and knowledge type documents are not so easy to classify
because it is difficult to predict which part of the content will be of
future value and how the information will be requested; in such cases free
text searching techniques can be employed. The concept relies on searching
keywords, an abstract of the document, or its complete text. In the case of
digital documents this is fairly simple to organise, but when input arrives
in the form of scanned pages recognition software must be employed to code
the text before it can be used for free text searches.
The information above is greatly simplified and expert
guidance is essential to ensure that the most appropriate method of control
and indexing is adopted. The important point to note is that indexing
requirements will greatly influence the choice of suitable Electronic
Document Management Software and hardware, so they must be fully
researched before a system can be selected or a scanner is purchased.
(next
chapter) (back
to top)