Data Inglorious - Tech Comfort LLC

Transcript Data Inglorious - Tech Comfort LLC

Data Inglorious
Atlas: “All this data sure is heavy.”
Data: “Indeed, may I suggest moving it to the cloud.”
database defined
• A database is a collection of data, which is
organized into files called tables.
• These tables provide a systematic way of
accessing, managing, and updating data.
• A relational database is one that contains multiple
tables of data that relate to each other through
special key fields.
• Relational databases are far more flexible (though
harder to design and maintain) than what are
known as flat file databases, which contain a single
table of data.
overview, the payload
•
•
•
•
•
Oracle Internet Directory, (OID)
Zynga Games/Farmville
Facebook
bioinformatics
Calmail
ex. oracle OID
• Oracle Internet Directory: 400,000 operations per
second on a 500 million user database
ex. zynga games
• 65 million players a day, millions of web browsers
open, millions of farms (Farmville game), millions of
frontiers, millions of objects bought and sold…all
recorded on a database
• 500,000 operations-per-second database behind
Farmville
• http://www.readwriteweb.com/cloud/2010/08/me
mbase-the-database-powering.php
ex. facebook
•
•
•
•
•
60,000 servers
1,800 MySQL servers,
400 million active users,
200 million a day
50 million operations per second
ex. bioinformatics
• DNA sequence data = prime
candidate for study with database
systems,
• Homologous strings
• Nucleic acids: Adenine, Guanine,
Cytosine, Thymine
• 3.4 million base pairs in the human
genome, expressed as a string of
AGC and T
• Human Genome Project : 3.4 billion
letters of the human genome,
Sanger Institute: 1 billion on MySQL
ex. calmail
• Calmail: 4 million e-mails offered a day, 1 million
served, MySQL backend, that just failed 
flat file v. relational
•
Imagine the needs of two small companies that take customer orders for their products. Company A uses a flat
file database with a single table named orders to record orders they receive, while Company B uses a relational
database with two tables: orders and customers.
•
When a customer places an order with Company A, a new record (or row) in the table orders is created. Because
Company A has only one table of data, all the information pertaining to that order must be put into a single
record. This means that the customer's general information, such as name and address, is stored in the same
record as the order information, such as product description, quantity, and price. If customers place more than
one order, their general information will need to be re-entered and thus duplicated for each order they place.
•
Whenever there is duplicate data, as in the case above, many inconsistencies may arise when users try to query
the database. Additionally, a customer's change of address would require the database manager to find all
records in orders that the customer placed, and change the address data for each one.
•
Company B is much better off with its relational database. Each of its customers has one and only one record of
general information stored in the table customers. Each customer's record is identified by a unique customer code
which will serve as the relational key. When a customer orders from Company B, the record in orders need
contain only a reference to the customer's code, because all of the customer's general information is already
stored in customers.
•
This approach to entering data solves the problems of duplicate data and making changes to customer
information. The database manager need change only one record in customers if someone changes addresses.
•
This is document ahrp in domain all.
Last modified on April 24, 2006.
•
Indiana University, Knowledge Base http://kb.iu.edu/data/ahrp.html
flat file v. relational
• Single table (flat file) v multiple tables (relational)
web Connection
• Example: Plone Content Management System
connection to a MySQL database
go graphic, phpMyAdmin
• A graphic interface tool for working with MySQL
phpMyAdmin
• GSPP and phpMyAdmin
• localhost
other database systems
• Hadoop: distributed processing of large data sets
• http://code.zynga.com/2011/06/deciding-how-tostore-billions-of-rows-per-day/
• Membase: new for games and other apps
• http://www.readwriteweb.com/cloud/2010/08/me
mbase-the-database-powering.php
• CouchDB: no schema
• http://couchdb.apache.org/docs/intro.html