Text and Data Mining File

Download Report

Transcript Text and Data Mining File

Text and Data Mining
Linda Pikula
NOAA
[email protected]
OceanTeacher Global Academy, Digital
Asset Management and Preservation
30, September – 4, October, 2013
KMFRI
Mombasa, Kenya
TDM=Text Mining
“automated processing of large amounts of
structured digital textual content for purposes
of information retrieval, extraction,
interpretation and analysis”
Bernie Reilly, Center for Research Libraries CRL
TDM=Data Mining
• Overview
Generally, data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different
perspectives and summarizing it into useful information information that can be used to increase revenue, cuts
costs, or both.
Data mining software is one of a number of analytical tools
for analyzing data. It allows users to analyze data from
many different dimensions or angles, categorize it, and
summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns
among dozens of fields in large relational databases.
Another Definition
“automated tools, techniques or technology to
process large volumes of digital content that is
often not well structured…to identify and
select relevant information; to extract
information from the content, to identify
relationships within/between/across
documents and incidents or events for metanalysis”
Eefke Smit
Another Definition
Text Mining discovers themes, patterns, emerging issues and
insights buried in document collections. By automatically
reading text and delivering algorithms for rigorous, advanced
analyses, the solution makes it possible to grasp future trends
and act on new opportunities more precisely and with less
risk. It can include advanced linguistic capabilities within the
core data mining solution
SAS definition
TDM
Business uses vary from scholarly uses
Class Discussion
How might business use data mining?
Health sciences?
Scholarly uses?
Reasons for TDM
To enrich content
Systematic review of literature
Discovery
Computation linguistics research
Steps in TDM
Hurdles to overcome
1. Researchers must be able to process
large amounts of content: automated
2. Researchers must identify questions
to be asked
3. Must be able to find the right sources
to be mined
4. Must be able to access these sources
5. Must be able to download the results
6. -To analyze and interpret
1. Software required?
2. Construct proper query
3. Obtain permission to access – if not
subscribed by an Institution-licensing
problems
4. Varying formates-no-standard
formats for storage
Librarians Role in Text/Data Mining
1. Advise on License Language- to develop
publishers licenses that address TDM
See work of California Digital Library and JISC and CRL
2. Assist Researchers in TDM-inform them of
TDM process, what data mining can do for
them and connect them with the tools to
accomplish TDM – through interviews
develop strategies, “pilot studies”
User Case
Since 1982
-90,000 journal articles on spinal cord injury
There has been an average of 22 journal
articles a day on spin cord injury
How can all this information be analyzed?
TDM
With the help of automated software a large amount of data and
text will be processed to identify entities, instances, actions,
relationships and patterns to do further analysis
Typical TDM Content
Text mining output typically consists of a new metadata layer for information:
-
-
Journal Article Clusters and categorizations, indexes
Topical maps, to show the occurrence of topic and their interelationships
Databases with fact, patterns, relationships, statements, assertions, properties
found in the articles
Visualisations: graphs, mappings, plot-graphs and topical maps
Class- Please View
Smit,Eefke and Maurits van der Graaf. Content
Mining a short introduction to practices and
policies presented for Center for Research
Libraries, July 17, 2013 (CRL Global Resources
Forum)
http://www.crl.edu/sites/default/files/follow_u
p_material/Smit.pdf
Class Please Read
https://blogs.libraries.iub.edu/scholcomm/2013/01/07/a-guide-to-text-and-data-mining-at-indiana-universitybloomington/
http://www.libraries.iub.edu/index.php?pageId=530000216
Tools for searching the Deep Web
Deep Dyve
http://www.deepdyve.com
Deep Web Technologies
http://www.deepwebtech.com
WorldWideScience.org
http://worldwidescience.org
Deep Web Harvester from BrightPlanet
http://www.brightplanet.com
Credits
Okerson, Ann. Text & Data Mining- A Librarian overview, IFLA WLIC, Singapore, August 7,
2013
http://library.ifla.org/252/1/165-okerson-en.pdf
Smit,Eefke and Maurits van der Graaf. Content Mining a short introduction to practices and
policies presented for Center for Research Libraries, July 17, 2013 (CRL Global
Resources Forum)
http://www.crl.edu/sites/default/files/follow_up_material/Smit.pdf
Speirs, martha A. Data mining for scholarly journals: challenges and solutions for libraries.
IFLA WLIC 2013, June 28, 2013
EMEA regional council meeting connects members to the latest in library data research:
Mining insights from 50 million books. NEXTSpace no. 21, May 2013
Utube-Text/Data mining, libraries and online publishers, July 17,2013. CRL.
http://www.youtube.com/watch?v=2e1xymY9ePg
Chiang,Katherine. Data mining, data fusion, and libraries. June 21, 2010. 31st Annual IATUL
Conference. Paper 4
http://docs.lib.purdue.edu/iatul2010/conf/day1/4