Transcript ming

WebInfoMall: the
Chinese Web Archive
how we got started and how it is now
Huang Lianen and Li Xiaoming
Peking University, China
Digital Archive Workshop
August 27, 2007, Xian, China
Outline

Motivation developed in 2001
 2001, I was not able to give an answer when some one
asked me what had been on Chinese web 1996.
 2100, I’d like to be able to answer concretely if some one
will ask me what were on Chinese web 2001 ?

Archiving technology
 For long-term web crawl and store, what technology should
be used, especially in a university lab environment ?

Exhibition of the archive
 How do we show the archive to the society ?
Institute of Network Computing and
On the elapsing nature of Web data


Li Xiaoming, “On the
estimation of the number
of previous Chinese Web
pages”, Journal of Peking
University, Vol.39, No.3,
May 2003, 394-398.
As a by-product, we also
obtained the result that
the time for 50% of
current web pages
disappearing is about 0.99
year.
Observing the elapsing nature, can we
archive them before they are gone ?
Institute of Network Computing and
With a search
engine, 50% is
done !
We have some advantage
The system work
started in 2001Institute of Network Computing and
The progress and current status
The crawl started in 2001 and the first batch of
data was put on line Jan 18, 2002.
 As of today, there is a total repository over 2.5
billion Chinese web pages (different), more
precisely, pages crawled from mainland China’s web
 About 1 million pages incremental every day.
 Initially, we used tapes for storage, but changed to
hard disks later.
 Total online data (compressed) volume ≈ 30TB,
with an off line backup.
 Spring 2002, “historical browsing” was provided;
summer 2006, beta test of “backward browsing”
was tested

Institute of Network Computing and
示例:InfoMall界面
Institute of Network Computing and
示例:输入www.sina.com.cn
Institute of Network Computing and
示例:2002.1.18新浪
Headquarter of Bin
Ladin was bombed.
Institute of Network Computing and
链接保持
The first air strike in new year,
American AF bombed the headquarter
of Bin Ladin.
Institute of Network Computing and
继续保持链接
Institute of Network Computing and
2002.10.8
Institute of Network Computing and
2003.9.2
Institute of Network Computing and
2004.5.28
Institute of Network Computing and
Featured collections: sars
Institute of Network Computing and
Featured collections: the first
manned space vehicle
Institute of Network Computing and
We ask three questions:
 What’s
the use ?
 Preserving historical information before it’s lost
 Implying great opportunities for deep mining
 Providing access to previous information much more
convenient than libraries even if they have kept it.
 Can
we do it ? (or at least get a pretty good
start)
 “we”: a university lab.
 How
we do it ?
Institute of Network Computing and
Can we do it ? (resource requirement)

“hard” resource
 Crawler system: 4 computers of $5,000 each
 Storage system: about 50 million pages per 1TB, amounts to
$4,000. If you need a backup, double the investment.
 Access web server: $4,000
 Space (not big, but reliable) to put these machines
 High speed network connection, ? per month ?

“soft” resource
 Permission for crawling and keeping
 A staff to handle the daily routine matters
 Persistent enthusiasm for this undertaking
Institute of Network Computing and
How we do it ?
 Incremental
crawling
 A scheduled daily operation, collect about one to
two million new pages a day, fingerprint
compared with previous pages
 Data
storage and incorporation
 Once a few weeks after having collected enough
data
 Accessibility
 Wayback machine style
 Featured exhibitions
Institute of Network Computing and
WebInfoMall: hierarchical module
data organization

Assurance of scalability and dynamic reconfigurability record : file : batch : disk : node : system
 Convenient for coping with changes at all levels
Matching logical data organization with physical devices structure as
Institute of Network Computing and
close as possible
The architecture
Institute of Network Computing and
The operations under the hood
Institute of Network Computing and
Comparison
 A survey
done
by National
Library of
China
 Web InfoMall
is the only
In the flattened world, large scale web
archive in
“small can act big !” China –
operated in a
university lab !
Institute of Network Computing and
Resource sharing

We have published data storage format
 And provide WebInfoMall data to research
community for free.
 The beneficiary research units include Peking University,
Tsinghua University, Chinese Academy of Sciences,
Shanghai Jiaotong University, Renmin Univerisyt of
China, Harbin Institue of Technology, ....

In particular, we built the largest Chinese Web Test
collection with compressed 200GB web pages
(CWT200g) for evaluation of Chinese web
information retrieval technologies
Institute of Network Computing and
Summary
WebInfoMall, http://www.infomall.cn is the
Chinese web archive since 2001, with over 2.5
billion pages in its repository as for 2007.
 Straightforward technology has been used
for building WebInfoMall

 Linux box + Berkeley DB + hierarchical module data
organization

We are looking into different ways to access
the data to get values more than just
information preservation and history browsing
Institute of Network Computing and
Thanks for your
attention

[email protected]
Institute of Network Computing and