roe_dataMining - Digital Humanities at Oxford

Transcript roe_dataMining - Digital Humanities at Oxford

The Dangers and Delights of
Data Mining
Glenn Roe
Digital.Humanities@Oxford Summer School
July 3 2012
Some opening thoughts....
• Machine Learning (ML) and Data Mining (DM) techniques will
drive future humanistic research as a central component of
future digital libraries.
• Old Digital Humanities (DH) tools were transparent. ML/DM
are opaque.
• General impact of ML on all humanities research: categorize,
link, organize, direct attention to some texts rather than
others automatically.
• Examine three areas of possible critical assessment.
• DH is uniquely well-suited to critique the application of
machine learning techniques in the humanities.
Emerging Digital Libraries
Scale of digital collections requires machine assistance to:
•
•
•
•
categorize and organize
propose intertextual relations
evaluate and rank queries
facilitate discovery and navigation
There are only about 30,000 days in a human life -- at a book a day, it would take 30
lifetimes to read a million books and our research libraries contain more than ten
times that number. Only machines can read through the 400,000 books already
publicly available for free download from the Open Content Alliance.
-- Gregory Crane
Only machines will read all the books.
And 5 million books?
We constructed a corpus of digitized texts containing about 4% of all books ever printed.
Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey
the vast terrain of “culturomics” focusing on linguistic and cultural phenomena that
were reflected in the English language between 1800 and 2000. We show how this
approach can provide insights about fields as diverse as lexicography, the evolution of
grammar, collective memory, the adoption of technology, the pursuit of fame,
censorship, and historical epidemiology. “Culturomics” extends the boundaries of
rigorous quantitative inquiry to a wide array of new phenomena spanning the social
sciences and the humanities.
www.sciencexpress.org / 16 December 2010
Culturomics…
Reading from afar… (or not at all).
Distant reading: where distance, let me repeat
it, is a condition of knowledge: it allows you to
focus on units that are much smaller or much
larger than the text: devices, themes, tropes—
or genres and systems. And if, between the
very small and the very large, the text itself
disappears, well, it is one of those cases when
one can justifiably say, less is more. If we want
to understand the system in its entirety, we
must accept losing something. We always pay a
price for theoretical knowledge: reality is
infinitely rich; concepts are abstract, are poor.
But it’s precisely this ‘poverty’ that makes it
possible to handle them, and therefore to know.
This is why less is actually more.
Franco Moretti, “Conjectures on World
Literature” (2000)
http://www.newleftreview.org/A2094
“Not Reading” has a long history.
L’Histoire du livre
•Dépot légal
•After death inventories
•Library holdings/circulation records
•Archives of publishers
•Vocabulary of titles (Furet)
•Censorship records
•…
Martin, Furet, Darnton, Chartier, etc…
From Not Reading to Text Mining
By “not reading” we examine:
concordances,
frequency tables,
feature lists,
classification accuracies,
collocation tables,
statistical models, etc…
We track:
Literary topoi (E.R. Curtius), concepts (R. Koselleck, Begriffsgeschichte),
and other semantic patterns: over time, between categories, across
genres.
So that distant reading and text mining can provide larger contexts for close
reading.
Text Mining as Pattern Detection
Data mining is the extraction of implicit, previously unknown, and potentially
useful information from data. The idea is to build computer programs that
sift through databases automatically, seeking regularities or patterns.
Strong patterns, if found, will likely generalize to make accurate
predictions on future data. Of course, there will be problems. Many
patterns will be banal and uninteresting. Others will be spurious,
contingent on accidental coincidences in the particular dataset used. And
real data is imperfect: some parts are garbled, some missing. Anything
that is discovered will be inexact: there will be exceptions to every rule
and cases not covered by any rule. Algorithms need to be robust enough
to cope with imperfect data and to extract regularities that are inexact but
useful.
-- Ian Witten, Data Mining: Practical Machine Learning Tools and
Techniques, xvix.
Transparency of traditional DH approaches
PhiloLogic: A few choice words...
Open Source: http://philologic.uchicago.edu/
Advantages:
• Fast, robust, many search and reporting features.
• Collocation tables, sortable KWICS, etc.
• Handles various encoding schemes and object types.
• Known to work with most languages.
Limitations:
• User initiated search for small number of words.
• Limited order of generalization.
• How to address larger issues (gender or genre).
• What to do with 150,000 (or more) hits?
Transparency of traditional DH approaches
PhiloLogic searches
return what you asked
for in the order in which
you asked.
Example: search for
various forms of
moderni.* 1850-99
You get 82 hits.
Results can be sorted
and organized.
Requires user selection.
The user sifts through
results and analyzes
effectively raw output
data.
Machine Learning is opaque...
ML systems depend on many assumptions and selections that
are not readily available to end users.
The hunt for Google’s infamous “secret sauce.”
Open competition to find the over 250 ingredients in the Google
search/sauce algorithms.
A “Black-box” industry: analyzing the secret sauce for profit.
Many commercial organizations examine Web mining
extensively:
e.g., “Search Engine Watch” www.searchenginewatch.com
Two ways of using DM in the humanities
1) Tool approach: PhiloMine, MONK, etc. allows direct
manipulation of data mining materials.
2) Embedded approach: results of machine learning or text
mining become part of general systems.
- Google and other WWW search engines
- Dedicated library systems (AquaBrowser)
Most humanities scholars will use embedded machine learning
systems.
Embedded Machine Learning Systems
Humanists are already using machine learning and data
mining in general applications:
spam filters
movie recommendations (Netflix)
related book/article suggestions (Amazon)
Adwords (monetizing the noun)
etc...
And coming soon to a library near you: LENS....
Embedded Machine Learning Systems
Building Data Mining Tools:
Three types of data/text mining
*Distinction is arbitrary and does not cover all text mining tasks.
1.Predictive Classification: learn categories from labeled data, predict
on unknown instances.
2.Comparative Classification: learn categories from labeled data to find
accuracy rate, errors, and most important features.
3. Similarity: measure document/part similarities, looking for
meaningful connections.
Predictive Classification
Widely used: spam filters, recommendation systems, etc.
Computer “reads” text, identifies the words (features) most associated
with each class (author, class of knowledge).
Humanities applications: extract classes or labels from contemporary
documents.
Use contemporary classification system rather than modern system to
predict classes.
*Problem: information space can be noisy, incoherent.
Predictive Classification
Text Mining the Digital Encyclopédie
74,131 articles in the current database
13,272 articles without classification (18%)
We trained our classifiers on the 60K classified articles (comprised of 2,899
individual classes) to generate a model which is then used to classify the
unknown instances, and then reclassify all 74K articles.
The resulting “ontology” was optimized to 360 classes – this is a typical result of
machine classification.
Predictive Classification
Classifying the unclassified:
•
•
•
•
•
•
•
•
DISCOURS PRELIMINAIRE DES EDITEURS, Class=Philosophy
DEMI-PARABOLE, Class=Algebra
Bois de chauffage, Class=Commerce
Canard, Class=Natural history; Ornithology
Chartre de Champagne, Class=Jurisprudence
Chartre de commune, Class=Jurisprudence
Chartre aux Normands, Class=Jurisprudence
Chartre au roi Philippe, Class=Ecclesiastical history
Chartre au roi Philippe fut donnée par Philippe Auguste vers la fin de l'an 1208, ou au commencement de l'an 1209,
pour régler les formalités nouvelles que l'on devoit observer en Normandie dans les contestations qui
survenoient pour raison des patronnages d'église, entre des patrons laiques & des patrons ecclésiastiques.
Cette chartre se trouve employée dans l'ancien coûtumier de Normandie, après le titre de patronnage d'église;
& lorsqu'on relut en 1585 le cahier de la nouvelle coûtume, il fut ordonné qu' à la fin de ce cahier l'on inséreroit
la chartre au roi Philippe & la chartre Normande. Quelques - uns ont attribué la premiere de ces deux chartres à
Philippe III. dit le Hardi; mais elle est de Philippe Auguste, ainsi que l'a prouvé M. de Lauriere au I. volume des
ordonnances de la troisieme race, page 26. Voyez aussi à ce sujet le recueil d' arrêts de M. Froland, partie I.
chap. vij.
Comparative Classification
“Comparative Categorical Feature Analysis”
Use classifiers as a form of hypothesis testing.
Train a classifier on a set of categories (gender of author, class of
knowledge).
Run the trained model on the same data to find:
• Accuracy of classification
• Most salient features
• Errors or Mis-classified instances
*Classification errors can be rich sources of inquiry for humanists.
Comparative Classification
Text Mining the Digital Encyclopédie
Original # of classes: 2,899 - New # of classes: 360
73.3% of articles were assigned to their original class, a number that is
amazing given the complexity of the ontology.
Which means that 26.7% of articles have a different class?
This also means that of the 74,131 articles:
44,628 classified correctly
16,231 classified “incorrectly”
13,272 unclassified were classified
Comparative Classification
Text Mining the Digital Encyclopédie
Accrues: original classification too specific
Tepidarium: reclassification seems more logical
Achées: incorrect prediction although appropriate given vocabulary
Comparative Classification
Predict classifications in other texts:
Classification of Diderot's Éléments de physiologie by chapter.
Most chapters classed as anatomy, medicine, physiology.
"Avertissement": literature
Chapter "Des Etres": metaphysics
Chapter "Entendement": metaphysics and grammar
Chapter "Volonté": ethics
Leverage a contemporary classification system as way to support
search and result filtering.
Clusters of Knowledge
Top: History, Geography, Literature, Grammar,
etc.
Middle : Physical Sciences, Physics,
Chemistry, etc.
Lower: Biological Sciences & Natural History
Clusters of Knowledge
Top: History, Geography, Literature, Grammar,
etc.
Middle : Physical Sciences, Physics,
Chemistry, etc.
Lower: Biological Sciences & Natural History
Clusters of Knowledge
Top: History, Geography, Literature, Grammar,
etc.
Middle : Physical Sciences, Physics,
Chemistry, etc.
Lower: Biological Sciences & Natural History
Clusters of Knowledge
Top: History, Geography, Literature, Grammar,
etc.
Middle : Physical Sciences, Physics,
Chemistry, etc.
Lower: Biological Sciences & Natural History
Similarity: Documents
Comparative and Predictive Classification one way to find meaningful
patterns by abstracting data from the text.
Typically build abstract models of a knowledge space based on
identified characteristics of documents. (Supervised learning)
Document similarity: unsupervised learning based on statistical
characteristics of contents of texts.
Many applications: Clustering, Topic Modeling, kNN classifiers etc.
Vector Space Similarity (VSM)
• Documents are “bags of words” (no word order).
• Each bag can be viewed as a vector.
• Vector dimensionality corresponds to the number of words in our
vocabulary.
• Value at each dimension is number of occurrences of the associated
word in the given document:
amour ancien livre propre
1
0
3
0
All document vectors taken together comprise a document-term matrix
*Used for many applications: information retrieval to topic segmentation.
Identification of similar articles
dj = (w1,j,w2,j,...,wt,j)q =
(w1,q,w2,q,...,wt,q)
Similarity: cosine of angle of two vectors in n-dimensional space, where
dimensionality is equal to the number of words in the vectors.
Identification of similar articles
Vector Space can be used to identify similar articles.
Size matters - some unexpected results.
GLOIRE, GLORIEUX, GLORIEUSEMENT, Voltaire,
VANITÉ, NA, [Ethics] [0.539]
VOLUPTÉ, NA, [Ethics] [0.514]
FLATEUR, Jaucourt, [Ethics] [0.513]
GOUVERNANTE d’enfans, Lefebvre, [0.511]
CHRISTIANISME, NA, [Theology| Political science] [0.502]
PAU, Jaucourt, [Modern geography] [0.493]
PAU: birthplace of Henri IV.
VSM: Strengths/Limitations
• Well understood.
• Standard and robust.
• Many applications: kNN
classifiers, clustering, topic
segmentation.
• Assigns a numeric score which
can be used with other measures
(e.g., edit distance of headword)
• Numerous extensions and
modifications : Latent Semantic
Analysis, etc.
• Bag of words: no notion of text
order.
• Requires identification of
documents or block: articles.
• Not suitable for running text.
• Cannot identify smaller
borrowings in longer texts.
• Similarity can reflect topic,
subject, or theme, unrelated to
“borrowing” or reuse.
Topic Modeling and LDA
• Topic modeling is a probabilistic method to classify text using
distributions over words.
• In statistics, latent Dirichlet allocation (LDA) is a generative model
that allows sets of observations to be explained by unobserved
groups that explain why some parts of the data are similar.
• This method of analyzing text was first
demonstrated by David Blei, Andrew Ng
and Michael Jordan in 2002.
Johann Peter Gustav Lejeune Dirichlet
What does LDA do?
• LDA is an unsupervised word clusterer and classifier.
• Preliminary assumption : each text is a combination of several
topics.
• Each document is given a classification with a ranking of the
most important topics.
• LDA generates distributions over words, or topics, from the text
and classifies the corpus accordingly.
dieu ame monde etre nature matiere esprit chose homme substance principe corps
univers philosophe systeme idee intelligence eternite rien divine existence creature
Prior research on LDA
• David Blei ran a series of experiments on the journal Science from
the year 1880-2002.
Topic : energy molecule atoms matter atomic molecular theory
(1900-1910)
"The Atomic Theory from the Chemical Standpoint"
"On Kathode Rays and Some Related Phenomena"
"The Electrical Theory of Gravitation"
"On Kathode Rays and Some Related Phenomena"
"A Determination of the Nature and Velocity of Gravitation"
"Experiments of J. J. Thomson on the Structure of the Atom"
Future research with LDA
• Text Segmentation : identify topic shifts within a document by
classifying paragraphs.
• Dynamic topic modeling : understand how discourse evolves over
time.
Example from David Blei on epidemiology :
1880 : disease, cholera, found, fever, organisms
1910 : disease, fund, fungus, spores, cultures
1940 : cultures, virus, culture, strain, strains
1970 : mice, strain, strains, host, bacteria
2000 : bacteria, strain, strains, resistance, bacterial
Strengths/Weakness of LDA
• LDA is a powerful tool to classify unclassified data sets.
• A lot of research is being done on Topic Modeling by computer
scientists : it is our challenge to use their findings and apply to text
analysis.
• LDA is just an aspect of the wider goal of having machines
contextualize text, identify coherent segments, and ultimately ease
the processing of very large corpuses.
A “Critical” Approach to Data Mining
Critique is a fundamental humanistic activity which is not necessarily
limited to texts (i.e., “reading the body”).
Machine learning will be a necessary component of future humanities
research, and Digital Humanities is uniquely situated and suited to a
critique ML tools and their applicability moving forward.
I will touch on three primary areas of critique drawn from our own
experiments with machine learning:
1) algorithms, features, and parameters;
2) classification and ontologies;
3) intertextual relations.
Opening the Black Box: PhiloMine
Open Source: http://code.google.com/p/philomine/
•
•
•
•
•
•
PhiloLogic extension uses existing services.
Permits moving to particular texts or features.
WWW based form submission with defined tasks.
Many classifiers (Support Vector Machine, etc).
Many features (words, n-grams, lemmas, etc).
Many feature selection and normalization options.
Opening the Black Box: PhiloMine
Algorithms, Features, & Parameters
Algorithms = classifiers, segmenters, similarities, aligners
Features = salient to task, elements of texts which can be
computed (words, lemmas, n-grams, etc.)
Parameters = many which have significant impact on results
The devil is in the combination of details at all levels...
Features and Parameters Matter
Parameter selection includes:
• type of features such as words, n-grams, and lemmas
• range of features, such as limiting to features that appear in a
minimum number of instances
• statistical normalization of features
• thresholds for various functions
Algorithm and parameter selection are task and data dependent
Selection of algorithms and adjustment of parameters can
radically alter results. For example...
Mining the Encyclopédie:
Vector Space Similarity
Mining the Encyclopédie:
Vector Space Similarity
Similarity - Unexpected Links
Gnomonique similar to Wolstrope. Why?
Gnomonique describes various types of
“cadrans” or sun dials that depend on
the movement of celestial bodies.
Wolstrope (modern geography) is the
birthplace of Isaac Newton.
Other most similar articles include
Saturn, Planet, Clock making and
Tylehurst, the birthplace of William Lloyd
with an long exposition of his work and
the history of the calendar by Newton.
*Gnomonics: the art or science of constructing dials, as
sundials, which show the time of day by the shadow of the
gnomon (γνώμων), a pin or triangle raised above the surface
of the dial..
Mining the Encyclopédie:
Vector Space Similarity
Same Vector Space Similarity
problem as before with TF-IDF
values rather than raw counts
for features.
TF-IDF normalizes word
frequencies across articles. The
weight increases proportionally
to the number of times a word
occurs in a document but is
offset by the frequency of the
word in the entire corpus.
Produces rather different
results.
Why Parameters Matter
Similarity using counts
Similarity using TF-IDF
Note differences in articles identified as most similar and different rankings of the
same articles (Wolstrope) when using different parameters.
On inspection, both lists are reasonable and interesting. Ombre (shadow) is related to
sundials.
Experimentation, selection, and evaluation required.
Why Features Matter
1. Similarity (TF-IDF) 2,830 features
2. Similarity (TF-IDF) 7,500 features
Feature reduction is a critical function in machine learning tasks
Figure 1: features in more than 3% of articles (2,830)
Figure 2: features in more than 1% of articles (7,500)
Note the differences in most similar articles and rankings.
Feature selection: critical Impact on all types of tasks.
Why Features Matter
Bi-grams: sequences of two words (in this case, lemmas, or root forms) with
function words removed.
From the article Gnomonique (with frequencies):
académie_royal 1
afin_empêcher 1
aller_voir 1
an_avant 1
an_fondation 2
an_jusque 1
ancien_géometres 1
ancien_historien 1
angle_devoir 1
appelloient_autrefois 1
apprendre_facilement 1
art_écrire 1
attribuer_invention 1
autant_petit 1
avant_alexandre 1
avant_appliquer 1
avant_époque 1
avril_septembre 1
beaucoup_aisé 1
beaucoup_haut 1
beaucoup_plûtôt 1
bout_duquel 1
cadran_cadran 1
cadran_horisontal 2
cadran_solaire 3
cadran_vertical 1
caracteres_suivans 1
cause_position 1
certain_déterminer 1
certain_jour 1
chacun_moi 1
chap_xxxviij 1
chaque_moi 1
chez_juif 1
chez_nation 1
circonférence_cercle 1
PhiloMine generates lemmas and ngrams.
Currently using TreeTagger for
English and French lemmatizing and
part of speech identification.
Why Features Matter
1. Similarity (TF-IDF) 2,830 features
2. Similarity (TF-IDF) 7,500 features
Similarity (TF-IDF) 19,000 bi-lemmas. Note again differences in identified articles and rank.
Similarity scores are much lower, reflecting the different distributions of n-grams. Similar
matches may be based on very small numbers of common features. PhiloMine can filter by a
threshold score.
Choosing Parameters and Features...
Feature and parameter selection have similar effects on other
kinds of machine learning algorithms, such as classifiers.
Open question: How do you choose features and parameters?
Do you simply rerun tasks until you find results you like?
What does this do to hypothesis testing in the humanities?
BLDR: what does finding 86% accuracy of nationality of author
really mean when we select among so many options?
Classifiers and Ontologies
Numerous kinds of classifiers:
Naive Bayes (MNB)
Support Vector Machines (SVM)
Decision Tree
Nearest Neighbor (kNN)
and many others.
Suitability to task:
SVM primarily binary classifier;
MNB fast but simple;
kNN slow, better on humanistic information spaces?
Different Classifiers, Different Results
Classify Chapters of Montesquieu, De l’esprit des Lois using Encyclopédie
classifications, or ontology:
Chapter: “Opérations sur les monnoies du temps des empereurs.”
kNN Best category = Money
kNN All categories = Money, Numismatics, Roman History
MNB Best category = Jurisprudence
MNB All categories = Jurisprudence
Chapter: “Des moeurs relatives aux combats.”
kNN Best category = Ethics
kNN All categories = Ethics, History Of Chivalry, French Language
MNB Best category = Literature
MNB All categories = Literature, Grammar
Ontologies are historical artifacts
Previous comparison based on the ontology of the Encyclopédie.
Humanists know that ontologies (classification systems) are temporal,
cultural, domain-specific artifacts.
Ontologies encode perspectives, worldviews, and power relations.
“Classification systems in general [...] reflect the conflicting,
contradictory motives of the sociotechnical situations that gave rise
to them.”
-- Bowker and Star, Sorting Things Out
If ontologies are contingent, how do we choose between them?
Ephraim Chambers, Cyclopaedia, 1728
Système figuré des
connaissance humaines.
Encyclopédie, 1751
Dewey Classification, 1876
Generated Ontologies:
Graphing the relationship of
the Encyclopédie classes
using centroids.
Multiplication of Ontologies
As shown on Michael K.
Bergman's AI3 site:
http://www.mkbergman.com/?p=374
An unlimited number of
ontologies.
Which ones will machine
learning tools in the
humanities use?
Intertextuality and Directed Reading
So, if algorithms, features, parameters, classifications,
and ontologies are all contingent... Where does
that leave us?
We could: use machine learning and data mining tools
for “directed reading,” i.e., approaches that aid in
the discovery of intertextual relations over
thousands/millions of books...
Intertextuality and Directed Reading
• We are working on systems to propose intertextual
connections, linking related passages or citations
between documents.
• Humans will then follow these machine
generated/proposed links.
• This type of “directed reading” will have impact on
what gets consulted.
• But, what happens to texts that fall outside of the
results of ML?
PhiloLine: Sequence Alignment
Open source: http://code.google.com/p/text-pair/
Investigation of intertextual relationships begins with the identification
of related passages using “sequence alignment.”
Technique to identify regions of similarity shared by two strings or
sequences, known in computer science as the ”longest common
subsequence” (LCS) problem.
Applications in many domains, including:
Bioinformatics: detection of similar DNA sequences;
Plagiarism detection in text and computer code;
Collation of texts or manuscript traditions, i.e., genetic criticism.
PhiloLine: Sequence Alignment
Look for sequences of common words or n-grams;
Only use n-grams of content words, filter out function words;
Adjust parameters to allow for more flexible matching, e.g., related but not
identical passages.
“L'homme est né libre, et partout il est dans les fers. Tel se croit le maître des
autres, qui ne laisse pas d'être plus esclave qu'eux.”
trigram
homme_libre_partout
libre_partout_fers
partout_fers_croit
fers_croit_maitre
croit_maitre_laisse
maitre_laisse_esclave
sequence
208-213
211-218
213-221
218-223
221-228
223-233
bytes
5084-31
5098-38
5108-46
5132-33
5149-42
5158-58
PhiloLine: Sequence Alignment
Locke, John, [1783], Du gouvernement civil (GALE-ECCO):
Que fi le pouvtoir légiilatif a été donné par le plus grand nombre , à une personne ou à plufieurs,
teulement à vie, ou pour un tems autrement limité; quand ce tems-là est fini,. le pouvoir
souverain retourne à la fociété; & quand il y ef retourné de cette manière, la fociété en peut
disposer comme il lui plaît, & le remettre entre les mains de ceux qu'elle trouve bon, & ainfi
établir une nouvelle forme de gouvernement . CHAPITRE X. De l'étendue du Pouvoir législatif.
IL. PAR une communauté ou un état, il ne faut donc point entendre, ni une démocratie, ni
aucune autre forme pré- cife de gouvernement,t
Encyclopédie, “GOUVERNEMENT,” Jaucourt
Si le pouvoir législatif a été donné par un peuple à une personne, ou à plusieurs à vie, ou pour un
tems limité, quand ce tems - là est fini, le pouvoir souverain retourne à la société dont il
émane. Dès qu'il y est retourné, la societé en peut de nouveau disposer comme il lui plait, le
remettre entre les mains de ceux qu'elle trouve bon, de la maniere qu'elle juge à - propos, &
ainsi ériger une nouvelle forme de gouvernement . Que Puffendorff qualifie tant qu'il voudra
toutes les sortes de gouvernemens mixtes du nom d'irréguliers , la véritable régularité sera
toujours celle qui sera le plus conforme au bien des sociétés civiles.
PhiloLine: Sequence Alignment
Locke, John, [1783], Du gouvernement civil (GALE-ECCO):
Que fi le pouvtoir légiilatif a été donné par le plus grand nombre , à une personne ou à plufieurs,
teulement à vie, ou pour un tems autrement limité; quand ce tems-là est fini,. le pouvoir
souverain retourne à la fociété; & quand il y ef retourné de cette manière, la fociété en peut
disposer comme il lui plaît, & le remettre entre les mains de ceux qu'elle trouve bon, & ainfi
établir une nouvelle forme de gouvernement . CHAPITRE X. De l'étendue du Pouvoir législatif.
IL. PAR une communauté ou un état, il ne faut donc point entendre, ni une démocratie, ni
aucune autre forme pré- cife de gouvernement,t
Encyclopédie, “GOUVERNEMENT,” Jaucourt
Si le pouvoir législatif a été donné par un peuple à une personne, ou à plusieurs à vie, ou pour un
tems limité, quand ce tems - là est fini, le pouvoir souverain retourne à la société dont il
émane. Dès qu'il y est retourné, la societé en peut de nouveau disposer comme il lui plait, le
remettre entre les mains de ceux qu'elle trouve bon, de la maniere qu'elle juge à - propos, &
ainsi ériger une nouvelle forme de gouvernement . Que Puffendorff qualifie tant qu'il voudra
toutes les sortes de gouvernemens mixtes du nom d'irréguliers , la véritable régularité sera
toujours celle qui sera le plus conforme au bien des sociétés civiles.
PhiloLine: Sequence Alignment
She locks her lily fingers one in
one. “Fondling,” she saith,
“since I have hemmed thee
here Within the circuit of this
ivory pale, I'll be a park, and
thou shalt be my deer; Feed
where thou wilt, on mountain
or in dale: Graze on my lips;
and if those hills be dry, Stray
lower, where the pleasant
fountains lie . “Within this
limit is relief enough....
Shakespeare, Venus and Adonis
[1593]
Pre. Fondling, said he, since I haue
hem'd thee heere, VVithin the
circuit of this Iuory pale. Dra. I pray
you sir help vs to the speech of
your master. Pre. Ile be a parke,
and thou shalt be my Deere: He is
very busie in his study. Feed where
thou wilt, in mountaine or on dale.
Stay a while he will come out
anon. Graze on my lips, and when
those mounts are drie, Stray lower
where the pleasant fountaines lie .
Go thy way thou best booke in the
world. Ve. I pray you sir, what
booke doe you read?
Markham, The dumbe knight. [1608]
Distant vs. Directed Reading:
What do we lose? What do we gain?
Distant reading: where distance, let me repeat it, is a condition of
knowledge: it allows you to focus on units that are much smaller or
much larger than the text: devices, themes, tropes—or genres and
systems. And if, between the very small and the very large, the text
itself disappears, well, it is one of those cases when one can
justifiably say, Less is more. If we want to understand the system in
its entirety, we must accept losing something. We always pay a
price for theoretical knowledge: reality is infinitely rich; concepts
are abstract, are poor. But it’s precisely this ‘poverty’ that makes
it possible to handle them, and therefore to know. This is why less is
actually more.
Franco Moretti, Conjuectures on World Literature (2000)
http://www.newleftreview.org/A2094
Conclusions...
Machine learning and data mining approaches will be necessary for
future humanities research and will “direct” researchers to
materials.
These techniques may not be best suited, however, at finding
“oddities,” exceptions, and other outliers that humanists love.
Your critique is central here.
Humanists understand the conditions of knowledge.
Digital humanities can thus bring both technical sophistication and
humanistic perspective to the critical analysis of machine
learning and data mining techniques.

roe_dataMining - Digital Humanities at Oxford

Transcript roe_dataMining - Digital Humanities at Oxford

Directory