collection of software tools

Transcript collection of software tools

Text-Garden Software Suite
Quick Overview
Marko Grobelnik, Dunja Mladenic
Jozef Stefan Institute
Ljubljana, Slovenia
Outline







What is Text-Garden?
How Text-Garden is being built?
Major functionalities
Technical aspects
Future developments
Availability
Text-Garden & SMART
What is Text-Garden?

Text-Garden is a software library and
collection of software tools for solving
large scale tasks dealing with structured,
semi-structured and unstructured data


…in particular, emphasis of functionality is on
dealing with text
It can be used in various ways covering
research and applicative scenarios

Being used by several institutions such as BT,
MSR, CMU, …
Some history…



The work started in 1996 as a set of C++ classes for
dealing with text and to perform text-learning tasks
(two people working on it)
…till 2002 it developed slowly according to the
academic tasks being on our agenda
From 2003 on Text-Garden became central software
platform JSI is used in many research and
applicative projects (~10 people contributing)

…all the solutions and results JSI is working on eventually
become part of Text-Garden environment
…local JSI development of Text-Garden
Projects
SMART
STREP
PASCAL
NoE
SEKT
IP
JSI team
Text-Garden
NEON
IP
…
Major functionalities
Functionality blocks
Lexical text processing
(tokenization, stop-words, stemming, n-grams, Wordnet)
Unsupervised learning
(KMeans, Hierarchical-KMeans,
OneClassSVM)
Semi-Supervised learning
(Uncertainty sampling)
Supervised learning
(SVM, Winnow, Perceptron, NBayes)
Dimensionality reduction
(LSI, PCA)
Visualization
(Graph based, Tiling, Density based, …)
Named Entity Extraction
(capitalization based)
Cross Correlation
(KCCA, matching text with other data)
Keyword Extraction
(contrast, centroid, taxonomy based)
Large Taxonomies
(dealing with DMoz, Medline)
Crawling Web and Search Eng.
(for large scale data acquisition)
Scalable Search (inverted index)
Lexical processing



Includes transformation from various formats into bag-of-words
representation
 …text/html, many custom formats (Svm-light, Reuters, …)
 …will support most text encodings
Lexical processing includes
 Tokenization
 Stop-words removal
 Stemming (Porter stemmer for English, we have ML for learning
stemmers from lexicons for other languages)
 Frequent N-Gram features (consecutive words co-appearing)
 Proximity features (words co-appearing within window)
 Wordnet integration (words co-appearing in synsets or in e.g.
hypernym relations)
Output of this level is .BOW (Bag-Of-Words) file
 …with dictionary and sparse vectors for documents
Unsupervised learning

Algorithms:

K-means clustering


Hierarchical-K-Means


…creating hierarchy of clusters
One-Class-SVM


…clustering into flat clusters
…learning from positive class only
…result is .BowPart (BOW Partition) file
Supervised learning

Following algorithms are implemented:







SVM (two class, regression)
Winnow
Perceptron
NaiveBayes
K-Nearest-Neighbor (KNN)
…
…result is .BowMd (BOW Model) file which is used
for further classification
Dimensionality reduction


Transform original space into low dimensional one
and project original data
Two classical methods

Latent Semantic Indexing (LSI)



…efficient implementation, working with sparse matrices
Principal Component Analysis (PCA)
…result is .SemSpace (Semantic Space) file which
is used for projecting the data

We can e.g. project original .BOW file into transformed
lower dimensional .BOW file
Named Entity Extraction

Simple and efficient NEE

…it is based on word capitalization




Candidate NEs (words and sequences of words) need to
be capitalized
Heuristic rule: capitalized candidates must appear within
text at least once
…we handle exceptions separately
Works well with some user interaction
Crawler & Search Engine

Crawler


…Slovenian internet archive being crawled by
Text-Garden Crawler
Highly scalable indexing & search of text
documents

E.g. indexing of 10M documents in several hours
Support for selected external sources

Text-Garden has special support for the following
databases and services










Google Search (Web/News/Scholar)
DMoz/Open Directory Project
Medline
WordNet
Yahoo! Finance
CIA World-Fact-Book
Cordis project database
Cyc (OpenCyc/ResearchCyc)
Reuters datasets (old, new), ACM TechNews, …
In preparation


Wikipedia, EuroVoc, AgroVoc (FAO)
MSN Search, Yahoo Search
Technical Aspects
Technical aspects

Text Garden is almost entirely written in
portable C++



…it compiles under Windows (Microsoft Visual
C++, Borland C++) and Unix/Linux (GNU C)
…it runs under 32bit and 64bit platforms
…it consists of ~200.000 relatively compact lines
of code
How to use Text-Garden functionality?

Text-Garden functionality can be accessed in a
number of ways:

As plain C++ classes


As DLL library of ~250 functions


~60 command line utilities getting connected in pipeline
Through GUI tools


Simplified extract of major functionality
As command line utilities


Complete functionality
(e.g. DocAtlas, OntoGen, …)
Through interfaces to several platforms

(Java, Matlab, …) – next slide
Multiplatform Text-Garden

Text-Garden has the following interfaces with the same API:







API has ~40 classes and ~250 functions


C/C++ - through simplified DLL & native C++
Java – through JNI
.NET – e.g. accessible through C#, VB, …
Matlab – through standard Matlab interface
Python – through standard Python interface
Mathematica, Prolog, R – in preparation
…interfaces to the all above platforms are generated automatically
from the master Text-Garden header file
…next slides include some examples in Matlab and Java
Text Parsing to TFIDF – Matlab
BowDocBsId = NewBowDocBs('en523', 'porter');
DId(1) = AddBowDocBs_HtmlDocStr(BowDocBsId, 'Economics', 'There are several basic
and incomplete questions that must be answered in order to resolve the problems of
economics satisfactorily. The scarcity problem, for example, requires answers to basic
questions, such as: what to produce, how to produce it, and who gets what is produced.
An economic system is a way of answering these basic questions. Different economic
systems answer them differently.', '', 1);
DId(2) = AddBowDocBs_HtmlDocStr(BowDocBsId, 'Oscar Wilde', 'Oscar Fingal OFlahertie
Wills Wilde (October 16, 1854 -- November 30, 1900) was an Irish playwright, novelist,
poet, short story writer and Freemason. One of the most successful playwrights of late
Victorian London, and one of the greatest celebrities of his day, known for his barbed
and clever wit, he suffered a dramatic downfall and was imprisoned after being
convicted in a famous trial for gross indecency (homosexual acts).', '', 1);
BowDocWgtBsId = GenBowDocWgtBs(BowDocBsId);
WIds = GetBowDocwgtBs_DocWIds(BowDocWgtBsId, DId(1));
for WIdN = 0:(WIds-1)
WId = GetBowDocWgtBs_DocWId(BowDocWgtBsId, DId(1), WIdN);
WordStr = GetBowDocBs_WordStr(BowDocBsId, WId);
WordWgt = GetBowWgtDocBs_DocWWgt(BowDocWgtBsId, DId(1), WId);
sprintf('%s:%.5f', WordStr, WordWgt)
end
import si.ijs.jtextgarden.*;
public class SVM
{
public static void main(String[] args)
{
System.out.println("Loading JTextGardenLib...");
JTextGardenLib tg = new JTextGardenLib();
SVM Classification – Java
System.out.println("Loading bow...");
int BowDocBsId = tg.LoadBowDocBs("./data/topic50k.bow");
tg.SaveBowDocBsStat(BowDocBsId, "./res/BowDocBsStat.txt");
System.out.println("Training linear SVM binary classifier...");
int ECatId = tg.GetBowDocBs_CId(BowDocBsId, "ECAT");
int BinSVMBowMdId = tg.GetBinSVMBowMd(BowDocBsId, ECatId);
System.out.println("Testing model...");
String Doc1 = "There are several basic and incomplete questions that " +
"must be answered in order to resolve the problems of economics " +
"satisfactorily. The scarcity problem, for example, requires answers " +
"to basic questions, such as: what to produce, how to produce it, " +
"and who gets what is produced. An economic system is a way of " +
"answering these basic questions. Different economic systems answer " +
"them differently.";
double CfyRes1 = tg.GetBowMdCfyFromHtml(BinSVMBowMdId, BowDocBsId, Doc1);
System.out.println("CfyRes1 = " + CfyRes1);
String Doc2 = "Oscar Fingal O'Flahertie Wills Wilde (October 16, " +
"1854 -- November 30, 1900) was an Irish playwright, novelist, poet, short " +
"story writer and Freemason. One of the most successful playwrights of late " +
"Victorian London, and one of the greatest celebrities of his day, known " +
"for his barbed and clever wit, he suffered a dramatic downfall and was " +
"imprisoned after being convicted in a famous trial for gross indecency " +
"(homosexual acts).";
double CfyRes2 = tg.GetBowMdCfyFromHtml(BinSVMBowMdId, BowDocBsId, Doc2);
System.out.println("CfyRes2 = " + CfyRes2);
}
}
import si.ijs.jtextgarden.*;
Google Web&News Querying – Java
public class Google
{
public static void main(String[] args)
{
System.out.println("Loading JTextGardenLib...");
JTextGardenLib tg = new JTextGardenLib();
System.out.println("Web Search...");
int WebRSetId = tg.GoogleWebSearch("slovenia", 50);
System.out.println("Number of hits: " + tg.GetRSet_Hits(WebRSetId));
for (int HitN = 0; HitN < 10; HitN++) {
System.out.println("Hit " + (HitN+1) + ": " + tg.GetRSet_HitTitleStr(WebRSetId, HitN));
}
System.out.println("News Search...");
int NewsRSetId = tg.GoogleNewsSearch("slovenia basketball");
System.out.println("Number of hits: " + tg.GetRSet_Hits(NewsRSetId));
for (int HitN = 0; HitN < 10; HitN++) {
System.out.println("Hit " + (HitN+1) + ": " + tg.GetRSet_HitUrlStr(NewsRSetId, HitN));
}
}
}
import si.ijs.jtextgarden.*;
import java.io.*;
public class ActiveLearning
{
public static void main(String[] args) throws IOException
{
System.out.println("Loading JTextGardenLib...");
JTextGardenLib tg = new JTextGardenLib();
Active Learning – Java
System.out.println("Loading bow...");
int BowDocBsId = tg.LoadBowDocBs("./data/CiaWFB.partly.bow");
String CatNm = "Europe";
int CatId = tg.GetBowDocBs_CId(BowDocBsId, CatNm);
int BowALId = tg.NewBowAL(BowDocBsId, CatId);
System.out.println("Starting Active Learning loop...");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
while (true) {
if (!tg.GetBowAL_QueryDIdV(BowALId)) { break; }
double QueryMnDist = tg.GetBowAL_QueryDist(BowALId, 0);
int QueryDId = tg.GetBowAL_QueryDId(BowALId, 0);
String DocNm = tg.GetBowDocBs_DocNm(BowDocBsId, QueryDId);
System.out.println("Does the following document belong to the \'" + CatNm + "\' category?");
System.out.println(DocNm + " [" + QueryMnDist + "]");
System.out.println("1-yes, 2-no, 3-stop");
int UserResponse = Integer.parseInt(br.readLine());
if (UserResponse == 3) { break; }
tg.MarkBowAL_QueryDId(BowALId, QueryDId, (UserResponse==1));
}
System.out.println("Finishing Active Learning ...");
System.out.println("Mark the rest of the documents? (1 - yes, 2 - no)");
int UserResponse = Integer.parseInt(br.readLine());
if (UserResponse == 1) {
tg.MarkBowAL_UnlabeledPosDocs(BowALId);
int Docs = tg.GetBowDocBs_Docs(BowDocBsId);
for (int DId = 0; DId < Docs; DId++) {
String DocNm = tg.GetBowDocBs_DocNm(BowDocBsId, DId);
int DocCIds = tg.GetBowDocBs_DocCIds(BowDocBsId, DId);
for (int DocCIdN=0; DocCIdN < DocCIds; DocCIdN++) {
int DocCId = tg.GetBowDocBs_DocCId(BowDocBsId, DId, DocCIdN);
String DocCatNm = tg.GetBowDocBs_CatNm(BowDocBsId, DocCId);
if (DocCatNm.equals(CatNm)) { System.out.println(DocNm + " [" + CatNm + "]"); }
}
}
}
}
}
Future developments

Around Text-Garden is being prepared a text-book
for text-mining


…Text-Garden will serve as software covering most of the
topics within the book
Text-Garden is getting extended by other sets of
functionalities




…Graph-Garden – dealing with graphs and networks
(collaboration with CMU)
…Media-Garden – dealing with images and videos
…Semantic-Garden – dealing with Semantic-Web issues
…Stream-Garden – dealing with streams of data
Availability


Text-Garden is under LGPL license
It is available from www.textmining.net

…new complete release will be uploaded soon
…some images from Text-Garden
Document-Atlas
Content-Land
SEKTbar
Semantic-Graphs
Contexter
Onto-Genesis