Fuzzy Data Base

Download Report

Transcript Fuzzy Data Base

Fuzzy Database &
Information Retrieval
Similarity relation defined for the domain opinion
HF
F
SF
SN
N
HN
 HF
 1
 0.8
 0.6
 0.2
 0
 0
F SF
0.8 0.6
1 0.8
0.8 1
0.6 0.8
0.2 0.6
0 0.2
SN
0.2
0.6
0.8
1
0.8
0.6
N
0
0.2
0.6
0.8
1
0.8
HN 
0 
0 
0.2 
0.6 
0.8 
1 
Query:
which sociologists are in considerable agreement with Kass
concerning policy Y?
Fuzzy Relational Data Base: Buckles, Petry
(1) Elements of the tuples contained in the relations may be
subsets of the domain universal set.
(2) A similarity relation is defined on each domain universal set.
Fuzzy Data Base
1. (Project (select Assessment where Name = Kass and Option =
Y) over Opinion) giving R1
Relation : R1
Option
favorable
Retrieve the opinion of Kass
concerning option Y
2. (Project (select Expert where Field = Sociologist) over Name)
giving R2
Relation : R2
Name
Osborn
Schreiber
Cohen
Specterman
Select all sociologists from
the table of experts
3. (Project (select (Join R2 and Assessment over Name) where
Opinion = Y) over Name, Opinion) giving R3
Relation : R3
Name
Opinion
Obsorn
Slightly favorable
Schreiber
Favorable
Cohen
Slightly negative
Specterman Highly favorable
List the opinions of the
sociologists
4. (Join R3 and R1 over Opinion) with THRES (Opinion) ≧ 0.75
and THRES (Name) ≧ 0
Name
Opinion
{Obsorn,
Slightly favorable,
Schreiber
favorable, highly
Specterman} favorable}
Information Retrieval
1. IR process can be viewed as a knowledge
communication process, which involves
learning and problem solving strategies.
2. IR system controls the knowledge flow
between documents and the user.
3. An IR system is a computer system that
allows users to retrieve information from a
document connection stored in a data base.
4. Classical IR is built on Boolean algebra
framework.
Fuzzy Information Retrieval
1. Fuzzy relation for the grade of relevance
between index terms and documents:
–
–
–
R: X x Y  [0,1]
Determined subjectively or objectively
Number of occurrences; publication dates,
document types
2. Fuzzy thesaurus
–
T: X x X  [0,1]
Fuzzy Information Retrieval
1. Information retrieval model
B  AT
Bx j   max min[ Axi , T ( xi , x j )]
xi X
D  B R
2. Advantages:
–
–
R and T are more expressive and their
construction is more realistic
Fuzzy inquiry provides greater flexibility
Information retrieval based on
fuzzy associations
1.
2.
3.
4.
5.
6.
7.
Introduction
Three components in information retrieval
Fuzziness in a thesaurus: first component
Fuzziness in retrieval: second component
Fuzziness on output: third component
Classification of output
Conclusion
2. Three components in information retrieval
D = {d1,d2,…,dn} be a finite set of documents for
retrieval
W = {w1,w2,…,wm} denote a set of descriptors
T : D ─>[0,1]w.T(d): a subset of descriptors in W
indexed to the document d.
U(U = T-1).U(w): documents have keyword w.
Information retrieval based on fuzzy associations
q
F
U
r’
P
r
3. Fuzziness in a thesaurus: first component
Three type thesaurus (represented as binary relation)
RT: related terms
NT: narrower terms
BT: broader terms
B(v,w) = N(w,v) R(v,w) = R(w,v)
Method of automatic generation of thesauri:
1. Typical:counting frequencies of simultaneous occurrences of
pairs of keywords in a set of documents.
2. Fuzzy set model:
C = {c1,c2,…,cp} be a finite set of concepts where each ci,
i=1,…p represents a unit of concept
H:W ─>[0,1]p a fuzzy set valued function which maps each
keyword to it’s corresponding concepts as a fuzzy set in C.
h ( w) : w  W is concept of the word w.
R ( v , w) 
h ( v )  h ( w)
h ( v )  h ( w)
N ( v , w) 
h ( v )  h ( w)
h(v )
Even by present computers, it’s difficult to calculate values
of the fuzzy relation above using array in straightforward way,
since the numbers of elements in W and D are very large(103 x
105). Although techniques to handle sparse matrices may be
applied, there is another method for generation R and N based
on manipulation of sequential files. The principle tool for this is
sorting.
(a,b,c) means a record in which field are a,b and c.{(a,b,c)}
means a set of records such as (a,b,c).
Input: a set D of documents, Each document d ∈ D has a
number of keywords in W.A keyword may occur twice or
more in a document. The frequency of occurrence of wi in dk
is denoted by hik.
Output: a set of records {(wi,wj,R(wi,wj)]} for all pairs
R(wi,wj)<>0
Algorithm GFT (generation of a fuzzy thesaurus).
// Find pairs of keywords in every document.//
For all dk∈D do
find all keywords wi∈W and calculate hjk
for all (wi,wj),wi<wj, that are found in dk do
make record (wi,wj, min(hik,hjk))
output (wi,wj,min(hik,hjk) to WORK1
repeat
for all wi that are found in dk do
make record (wi, hjk)
output (wi,hjk) to WORK2
repeat
repeat
//sort WORK1 and WORK2.//
sort WORK1 into increasing order of the key (wi,wj)
sort WORK2 into increasing order of the key wi
//Calculate R.Scan WORK1 and WORK2.//
for all (wi,wj) in WORK1 do
find all record for (wi,wj) in WORK1
and all records for wi, and wj in WORK2
R (wi,wj)←∑k min(hik,hjk)/(∑k hik+ ∑k hjk- ∑kmin(hik,hjk))
output (wi,wj,R(wi,wj)) to an output file
repeat
end-of algorithm GFT
In a foregoing paper an experimental calculation on three
thousand documents and thirty thousand keywords was
carried out using GFT based on sorting shows a reasonable
amount of 800 sec of CPU time.
//record (di,pi)//
//before another record (di,pi) satisfies either di<dj or//
//di = dj, pi > pj//
Take the first record (d1,p1) in work
(D,P)<-(d1,p1)
for all dj in WORK do
//the dj’s are sequentially examined.//
if D <> dj then
output (D,P) to to an output file OUT
(D,P)<-(di,pj)
endif
repeat
output(D,P) to OUT
//OUT contains exactly those records that represent
P=Uf(d,w) define by above//
//Third step: if necessary sort again.//
sort OUT into the decreasing order of the key p and print
OUT
4. Fuzziness in retrieval: second component
For the crisp case a retrieval through a thesaurus
given a keyword w is as follows.
(a) Examine the thesaurus F and find all associated
terms v11,v12,…,v1p.
(b) Find subsets U(v11),U(v12),…,U(v1p).
(c) Establish the retrieved set of documents as the
union of U(v11),U(v12),…,U(v1p): ∪1≦i≦p U(v1i)
Uf(d,w) = 1 iff d∈U(v1) for some v1 such that
F(v1,w) = 1,
0 otherwise.
When the thesaurus F is fuzzy and U is crisp
Uf(d,w) = max v∈W min [U(d,v),F(v,w)].
This equation is valid also for a fuzzy relation U(d,v).
Algorithm FR(Fuzzy Retrieval).
//First step: Find all records.//
for all v such that F(v,w) <> 0 in FT do
for all d∈U(v) do
p(d,v)<-min[U(d,v),F(v,w)]
output record (d,p(d,v)) to a work file WORK
repeat
repeat
//second step: Find values of Uf.//
sort WORK into increasing order of the first key d
and into decreasing order of the second key p
//the above sorting means that in the resulting sequence,a//
End-of FR
5. Fuzziness on output: third component
Fuzzy filter. EX:
(a) Find recent documents that have keyword w.
(b) Find documents that have keywords w and are
relevant to one’s field of interest
r = r’ ∩ g
6. Classification of output
1) Decreasing of membership
2) Divide into layers
7. Conclusion
Problem for further studies
1) Discussion of crisp techniques of advanced
indexing and retrieval using a fuzzy set model,
2) Studies of efficient algorithms for large scale
database. In particular, development of hardware
for information retrieval should be taken into
account.
3) Application of methods in fuzzy information
retrieval to related areas.
個人化服務元件技術之研究
郭耀煌教授
成功大學資訊工程系
成功大學數位生活科技研究中心