Slides - Book and Byte

Download Report

Transcript Slides - Book and Byte

Digital Text and
Data Processing
Week 1
Course background
□ Future of reading
□ Understanding “Machine
reading”:
□ Text analysis tools
□ Visualisation tools
□ Differences between
machine reading and human
reading
Images taken from textarc.org and from
Google App store, Javelin for Android
Scale
Text Mining
□ “a collection of methods used to find
patterns and create intelligence from
unstructured text data” (1)
□ Related to data mining
□ Information is found “not among
formalised database records, but in the
unstructured textual data” (2)
(1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society
Forum Winter (2006), p. 51
(2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing
Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1
Difficulties of natural language
One thing was certain, that the WHITE kitten had had nothing
to do with it:--it was the black kitten's fault entirely.
For the white kitten had been having its face washed by the
old cat for the last quarter of an hour (and bearing it
pretty well, considering); so you see that it COULDN'T have
had any hand in the mischief.
Down, down, down. There was nothing else to do, so Alice
soon began talking again. 'Dinah'll miss me very much tonight, I should think!' (Dinah was the cat.) … And here
Alice began to get rather sleepy, and went on saying to
herself, in a dreamy sort of way, 'Do cats eat bats? Do cats
eat bats?'
In a Wonderland they lie, Dreaming as the days go by,
Dreaming as the summers die: Ever drifting down the stream,
Lingering in the golden gleam. Life, what is it but a dream?
□ Semantic categories are generally implicit
□ Inflections: conjugations and declension
□ Homonyms and synonyms
□ Meaning is context-specific
□ Spelling changes over time or may vary
across regions
I trod on grass made green by summer's
rain,
Through the fast-falling rain and highwrought sea
'Tis like a wondrous strain that sweeps
And suddenly my brain became as sand
She mixed; some impulse made my heart
refrain
were found where the rainbow quenches its
points upon the earth
Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’
Two stages in text mining
□ Data creation
□ Data analysis
Weekly Programme
Cluster 1: Data creation
□ W1: Introduction to the course and
introduction to the Perl programming
language
□ W2: Regular expressions, word
segmentation, frequency lists, types and
tokens
□ W3: Natural language processing: Part of
Speech tagging, lemmatisation
□ W4: Exploration of existing text mining
tools
Weekly Programme
Cluster 2: Data analysis
□ W5: Introduction to R package
□ W6: Multivariate analysis: Principal
Component Analysis, Clustering
techniques
□ W7: Visualisation
□ W8: Conclusion: What type of knowledge
can we create?
Individual Research project
□ Techniques taught in DTDP generally
enable you to study formal differences
and similarities between texts, e.g.
vocabulary, sentence length,
grammatical structure
□ Create a corpus of a least four different
texts, of ca. 5000 words each; you can
copy texts from existing corpora
□ You can apply the techniques which are
explained in this class to your own
corpus
□ Formulate your own research question
Course evaluation
□ 3 assignments (1 point to be earned for
each)
□ Final essay (ca. 3,000 words)
□ Report of your individual research
project (3 points)
□ Critical reflection on digital humanities
research (4 points)
□ What sort of knowledge can be
produced? How does this type of
research relate to traditional
scholarship?
□ Is programming a legitimate
scholarly activity in the humanities?
□ Can visualisations of texts function
as independent scholarly resources?
Introduction to programming
□ Programming languages: used to give
instructions to a computer
□ There is a gap between human language and
machine language
□ Digital information is information represented
as combinations of 1s and 0s,
e.g.: A = 01100001
□ First generation programming languages:
Assembler, eg ADD X1 Y1
□ Higher-level programming languages:
Compilers or Interpreter
Human
Programmer
Programming
language,
e.g. Perl
Machine
Language
0101100101010
Language
processor
Computer
□ First generation programming languages:
Assembler, eg ADD X1 Y1
□ Higher-level programming languages:
Compilers or Interpreter
Human
Programmer
Programming
language,
e.g. Perl
Machine
Language
0101100101010
Language
processor
Computer
Algorithm
□ Etymology:
Muhammad ibn Musa al-Khwarizmi,
Al-kitāb al-mukhtaṣar fī ḥisāb al-ğabr
wa’l-muqābala
□ Unambiguous descriptions of the steps
which need to be followed to arrive at a
well-defined result
□ Developed by human beings!
Getting started
1. Create a working directory on your computer
2. Open a code editor and type the following
lines:
print “It works!” ;
3. Use the .bat file that is provided
Variables
□ Always preceded by a dollar sign
$keyword
□ Variables can be assigned a value with a
specific data type (‘string’ or ‘number’)
$keyword = “time” ;
$number = 10 ;
□ Three types of variables: scalar, array, hash
Strings
□ Can be created with single quotes and with
double quotes
□ In the case of double quotes, the contents of
the string will be interpreted.
□ You can then use “escape characters” in your
string to add basic formatting:
“\n” new line
“\t” tab
Statements
□ Perl statements can be compared to
sentences.
□ Perl statements end in a semi-colon!
print “This is a statement!” ;
Exercise
Print a string that looks as follows:
This is the first line.
This is the second line.
This line contains a
tab.
Operators
=
Assignment
e.g.
$a = 5 ;
Arithmetic operators
+
*
Addition
Subtraction
Multiplication
Exercise
Create two variables, and assign
a numerical value to both of
them
Print their sum, their difference
and their product.
Reading a file
Is done as follows:
open ( IN , “shelley.txt” ) ;
while ( <IN> ) {
print $_ ;
}
close ( IN ) ;
Exercise
Create a Perl application which can read the
text file “shelley.txt” and which can print all
the lines.
Control keywords
if ( <condition> ) {
<first block of code>
} elsif ( <condition> ) {
<second block of code>
} else {
<last block of code ;
default option>
}
Regular expressions
□ The pattern is given within two forward
slashes
□ Use the =~ operator to test if a given string
contains the regex.
□ Example:
$keyword =~ /rain/
Exercise
Create an application in Perl which can read a
machine readable version of Shelley’s Collected
Poems (file is provided) and which can print all
lines that contain a given keyword.
(suggestions: “fire” , “rain” , “moon”, “storm”,
“time”)
Regular expressions (2)
□ If you place “i” directly after the second
forward slash, the comparison will take place
in a case insensitive manner.
□ \b can be used in regular expressions to
represent word boundaries
if ( $keyword =~ /\btime\b/i ) {
}
Additional exercises
□ Create a program that can count the total
number of lines in the file “shelley.txt”
□ Create a program that can calculate the
length of each line, using the length()
function
length( $line ) ;
□ Calculate the average line length (in
characters) for the entire file.