Transcript big noisy

Visualizing digital footprints
of our complex life
Data mining
You don't have to be a rocket scientist to be a data scientist!
János Abonyi
Data mining is the extraction of
implicit,
previously unknown,
and potentially useful
information from data.
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Papers, Files, Web documents, Scientific experiments, Database Systems
Knowledge Discovery (KDD) Process
Knowledge
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Problem => Hypothesis
√
?
Model
Identification
Exploratory
data analysis
Check the
hypothesis
Generate
hypothesis
Supervised learning
Unsupervised learning
Classification
Clustering
Frequent itemset mining,
Association rules
Anomaly detection
Regression
Recommender systems, collaborative filtering
Sentiment analysis
Sentiment analysis (opinion mining) refers to the use of natural language processing,
text analysis and computational linguistics to identify and
extract subjective information in source materials
Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity
A\P
C
¬C
C
TP FN
P
¬C
FP TN
N
P’
N’ All
Class Imbalance Problem:
One class may be rare, e.g. fraud, or
HIV-positive
Significant majority of the negative class
and minority of the positive class
Classifier Accuracy, or recognition rate:
percentage of test set tuples that are
correctly classified
Sensitivity: True Positive recognition rate
Sensitivity = TP/P
Accuracy = (TP + TN)/All
Specificity:
True Negative recognition rate
Specificity = TN/N
Error rate: 1 – accuracy,
or
Error rate = (FP + FN)/All
10
Time-series mining
Clustering
Classification
Rule discovery
Content based search

s = 0.5
c = 0.3
Outlier
detection
A
0
B
500
1000
Motifs
C
1500
2000
2500
0
20 40 60 80 100 120
A
B
C
Applications
What people
think about EU?
Decision
What influences
regional
development?
How to measure
the quality of
life?
Operation
Predictive
modeling
Early warning
Information
Link
analysis
Segmentation
Tools
Anomaly
detection
Data
Regression
Classification
Freq. itemset
Clustering
Time demand
Problem analysis
Data analysis
Collection of data
Data cleaning
Data mining
Reporting
Application
Feedback
0%
5%
10%
15%
20%
25%
30%
The major challenge for data scientists:
The Data-to-Knowledge (D2K) challenge
Big Data: Over 80% of our data is from text/natural
language/social media, unstructured, noisy, dynamic,
unreliable, …, but interconnected!
Keys from big data to big knowledge: Structuring!
transforming unstructured text into structured, typed,
interconnected entities/relationships
Networking
take advantage of massive, structured connections
Mining/reasoning
effectively on massive, relatively structured, interconnected networks
D2K → D2N2K (Data to Network to Knowledge)
Construction and mining of typed, heterogeneous information networks
Teamwork – Big Data Workshop
Let’s see the details
You don't have to be a rocket scientist to be a data scientist!
Administrative datasets
EU Open Data Portal
http://www.europeandataportal.eu/
Single point of access to a growing range of data from the
institutions and other bodies of the European Union (EU).
Data are free for you to use and reuse for commercial or noncommercial purposes.
Data.gov
https://www.data.gov/
The home of the U.S. Government’s open data
World bank
http://data.worldbank.org
World Bank Open Data: free and open access to data about
development in countries around the globe.
OECD DATA
https://data.oecd.org/
OEDA
http://openeventdata.org
OECD data: charts, maps, tables and related publications
The prime objective of the OEDA is to provide reliable, open
access, multi-sourced political event datasets that are
updated at least weekly, are transparent and have
documented source texts, and use one or more of the open
coding ontologies supported by the organization
EHPS
http://primary-sources.eui.eu
The purpose of EHPS is to provide an easily searchable index
of scholarly digital repositories that contain primary sources
for the history of Europe
ENGAGE
ENGAGE is a door for researchers that leads them to the
world of Open Government Data. By using the ENGAGE
platform, researchers and citizens will be able to submit,
acquire, search and visualize diverse, distributed and derived
http://www.engagedata.eu/opendatasit Public sector datasets from all the countries of the European
es
Union.
EUROSTAT
http://ec.europa.eu/eurostat/data/data
base
European Statistics
http://atlas.media.mit.edu/
https://public.tableau.com
http://datausa.io
Linked open data
PREFIX db: <http://dbpedia.org/resource/>
PREFIX onto: <http://dbpedia.org/ontology/>
SELECT *
WHERE { ?s onto:birthPlace db:Kőszeg }
http://iwb.fluidops.com/
http://pantheon.media.mit.edu/
RSS (Rich Site Summary uses a family of standard web feed formats to publish
frequently updated information: blogentries, news headlines, audio, video
2016_BD_Workshop\Demo\to CartoDB\Events-1
https://iask.cartodb.com/viz/3b678bfe-3c68-11e6-a3eb-0e3ff518bd15/public_map
Network of towns in Wikipedia
tables.googlelabs.com
http://analysis.gdeltproject.org/
Looking across the nearly 200 million articles from across the entire world in 65
languages monitored by GDELT in 2015, we wanted to explore geographic
correlation. The map below draws a line between every pair of locations
mentioned together in the same article at least 100 times across the entire 200
million article archive
http://analysis.gdeltproject.org/module-gkg-tonetimeline.html
Application programming interface (API)
is a set of routine definitions, protocols, and tools
for building software and applications.
JSON - JavaScript Object Notation) is an open-standard format
that uses human-readable text to transmit data objects
Open refine
Extraction structured information from web forums
Nice reports …
Text mining
Development
Sustainable
Country
Nature
Global
Goal
Financial
Human
World
Development
Life
Social
Nature
VOS viewer
C:\Users\János\Dropbox (Bigdata)\Bigdata Team Folder\Abonyi\Research\sciNET\VOS_viewer
VOS viewer -Map of Social Sciences
Since 1976 only 45 publications of Hungarian Scientists were related to
„sustainable development” (according to abstracts in Scopus database)
What is in the books ?
https://books.google.com/ngrams/graph?content=migration%2CEU&year_start=1800
&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cmigration
%3B%2Cc0%3B.t1%3B%2CEU%3B%2Cc0
https://www.google.hu/trends/
https://www.google.hu/trends/explore#q=big%20data%2C%20%2Fm%2F06n6p%2C%
20Migration&cmpt=q&tz=Etc%2FGMT-2
Thank you …
[email protected]