Documentary databases - Associatie Antwerpen

Download Report

Transcript Documentary databases - Associatie Antwerpen

Documentary databases
with introduction to the ISIS-database
technology.
An introduction
by
Egbert de Smet
Univ. of Antwerp, Belgium
Summary
• Relational database technology is not the best
solution for all types of information
• Another model ‘NO-SQL’ is becoming more
popular, esp. in web-environments
• ISIS uses a non-relational model for flexibility,
but is mostly suited for semi-structured
information with semi-relational capabilities
• An ISIS database uses a MST with XRF-index
and Inverted File with postings (addresses) for
searching and a powerful ‘Formatting Language’
9/04/2017
New trends in databases (1)
• No-SQL (non-relational) databases :
▫ not all information needs ACID-requirements
▫ websites e.g. need 'scalability' and flexibility
• Examples :
▫
▫
▫
▫
▫
▫
MongoDB (Drupal 5 x faster than with MySQL)
CouchDB
Cassandra (Apache)
Berkeley DB (Oracle, used in JISIS)
Oracle NoSQL
Big Data (Google)
• hybrid databases, e.g. Virtuoso
9/04/2017
New trends in databases (2)
• 'schemeless' database structures : no fixed
predefined 'fields'
▫ each record has its own structure with structural
ID
• 'fingerprints' : a structural or content-based
'summary' of the document is stored in the
record and indexed for faster retrieval
▫ e.g. the old 'ISO2709' standard is actually an
example of this
9/04/2017
New trends in databases (3)
• 'Triple'-stores to store the 'Semantic Web'
• all information is expressed as triples :
▫
▫
▫
▫
subject
predicate/property
object
e.g. X is author of Y, A is friend of B etc.
• such triples are stored mostly in RDF
▫ use references to authority elements (e.g. URI, author ID's,
thesauri) instead of literal values
▫ library catalogs : publication URI's, RDA/FRBR based
cataloging to add more relations in the catalog
Documentary vs. relational
• In computer science, mostly ‘databases’ are seen
as ‘relational’
• Because that model fits well for administrative
data, such as addresses, staff records etc.
• Libraries, but also web-sites e.g., however use
different, less-structured data, with in fact each
record carrying its own structure : ‘documents’
The relational model (SQL)
• Relational databases : all data go into tables
• Tables are ‘matrices’ with rows and colums
▫ Each row is a ‘case’ (or record)
▫ Columns are fields
▫ Because of the matrix-structure all records must
have the same length, all fields have fixed length,
defined beforehand
• By splitting data into several tables – and
relating or linking them – some flexibility is
introduced
ID
Title
Author1
A single matrix
Author2
Keywords
1
Title 1
Author_a
Author_b
Food; security;
2
Title 2
Author_b
Author_c
Food; health;
3
Title 3
Author_a
4
Title 4
Author_a
History;
economy;food
author_d
Food; agriculture
Problems :
•The problem : how to cope with an undefined number of
authors ?
•What is the maximum length of a title ?
•How can one search one single keyword ?
Solution : create a table with authors/keywords and an
intermediary table to link authors to titles.
The relational model
ID
Title
Editor
1
Title_1
Editor_1
2
Title_2
Editor_2
Title_
ID
Author
_ID
ID
Name
1
a1
a1
Author_a
1
a2
a2
Author_b
2
a2
a3
Author_c
2
a3
a4
Author_d
3
a1
4
a1
4
a4
Solution : linked tables or ‘relations’
When sorted on title_ID, all authors of
a title are listed until next title
Relational model :
(dis)-advantages
•Advantages : ACID
• After ‘normalizing’ the data into several tables, all data
are to be kept only once, keeping consistency and
avoiding reduncancy
• E.g. If an author changes gender, it has to be changed
only in that single record
• Data can be changed into the same storage space where
already present
•Disadvantages :
• Empty fields take space : a lot of space is ‘lost’
• one ‘record’ is split over many tables; HD-heads have to
read several blocks from different parts/sectors of the
HD
• Every move into one table causes index-pointers to also
move into all related tables
• No flexibility : architecture has to be well
planned/designed ahead
The no-SQL model
I
D
Document_value
1
Text_string
2
XML_string
3
BLOB
4
ISO2709
•The key-value pairs : each row has an identifier and the document is stored in the
value-part
•The value-part can be any structure, e.g. an XML-file, a BLOB...
•This is better suited for ‘documents’ in a database : each record is a document
with its own structure, no fixed lenghts or fields
•Large websites can be stored using this model
e.g. In Drupal 7 : MongoDB outperforms MySQL by 5 times
•Examples :
Google : BigTable, Cassandra, MongoDB, CouchDB, BerkeleyDB
ISIS as a No-SQL database
•ISIS (1975!!) uses this model long before it was officially named as such
(‘avant la lettre’)
•The ‘value’ is an ISO-2709 record with header, directory and field-values
• header : numerical descriptions, e.g. Total length of record
• directory : ID, starting position and length of each field
• fields : the values themselves
•Semi-relational : with the ‘REF’-function ISIS can combine data from
different databases at ‘run-time’ (meaning : only when needed, not by design)
•High flexibility : fields can be 0-x times present, the directory will tell the
software; fields can be variable length within given max. Record-length
•Records of any structure can be merged into one single database or arranged
into different databases with REF-function
•If author changes gender : correct original gender is kept into that record, but :
reduncancy (made up for by more efficient storage)
The ISO2709 record as a structural
fingerprint
• example record :
0084600000000027700045000010031000000040004000310230009000351200114000440030002001580050002001601000011001621000011001731090033001841210067002171220017002841230015003016000005
00316220013000321200001700451240004200468250001000510324002800520
332001000548343000500558350000500563#ABT ASSOCIATES INC./AGRICULTUR#AMS#19951205
#Conducting Pan-European research: a preliminary evaluation of a new methodology
for European aquaculture research#B#K#Shaw, S.A.#Bailly, D.#^aUniv. Strathclyde
^bGlasgow^cUK#3. Annu. Conf. of the European Association of Fisheries Economists
#Dublin (Ireland)#10-12 Apr 1991#^aen#Proceedings of the third Annual Conference
of the European Association of Fisheries Economists, Dublin, Ireland, 10-12 Apr
il 1991#Hillis, J.P.^ed.#^aDublin (Ireland)^bThe Stationery Office#^p163-175#Ir.
Fish. Invest. [B. Mar.]#0578-7467#1994#^i42#~
• Advantages :
▫ By only reading the header/directory of the record, the whole
structure of the ‘document’ is known and the parser does not need to
parse the document itself
▫ The header/directory can be created by the software at the time of
creating the record when there is plenty of time
• Disadvantage :
▫ According to classic ISO-2709 only 5 positions are provided for total
lenght, menaing the max. Lenght is 1Mb
The ISIS database model
• All ISO-2709 records (MFNs) are stored into the
MST
• A 1st order index (XRF) with fixed record-length
stores the MFNs and their starting position
• A B-tree index creates an ‘inverted file’ (2nd
order index) to keep full ‘addresses’ (MFN, Field,
Occurrence, Position) of extracted search-keys
• A powerful ‘parser’ language (PFT) allows
detailed definition of values to be extracted
9/04/2017
The ISIS database model (2)
• compare to :
▫ file-systems : files are opened by checking their
exact location in the file-system index
▫ memory management : all values for a software
are stored in memory and called by their exact
location in memory
▫ when XRF fits in RAM : very fast
• but : in addition the ISO2709 header defines the
document structure and speeds up the 'parsing'
9/04/2017
Conclusion
• (CDS/)ISIS as a database-model is indeed very
old but yet still quite modern, as it is followed by
many new modern databases
• built-in, principle flexibility allows systemmanagers, rather than programmers, to create
any database structure (library catalogs, digital
library, musea, archives, factual data...)
• the ABCD system as an example : much more
than just library automation
9/04/2017
THANK YOU
• Questions ?
• Remarks ?
• demo : ABCD / JISIS
• practical exercises
• [email protected]