last update: 2014
This page contains a summary of resources of interest for research. It is not intended to provide an exhaustive list, but some references to start from. From this page you can also browse to further information on my software and research.
Specific data sets
Data set | Owner | Context | Description |
---|---|---|---|
The SEER data [+] | U.S. National Cancer Institute | N/A | Diagnosis |
ICML 2009 dataset (2009) [+] | Dept. of UCSD Computer Science and Engineering | Web mining - Usage | Detection of malicious URLs (spam, phishing, exploits, and so on) |
The Public Terabyte Dataset Project (2010) [+] | Bixo Labs / Amazon | Web Mining - Content & Usage | The data comes from a crawl of 50-200M pages from the 100K top (by traffic) English language domains. |
The Internet Traffic Database (2008) [+] | Lawrence Berkeley National Laboratory / ACM SIGCOMM | Web mining / Usage | HTTP requests on different servers |
Web->KB Project [+] | CMU World Wide Knowledge Base | Web Mining - Content / MRDM | It contains: (1) A data set consisting of classified Web pages. (2) A relational data set describing both pages and hyperlinks. (3) A subset of the 4 Universities dataset containing web pages and hyperlink data. (4) 20 newsgroups dataset (5) 7sectors dataset. |
Syskill and Webert Web Page Ratings [+] | UCI | User Ratings | To predict user ratings on web pages |
KDD Cup 2005 [+] | ACM SIGKDD | N/A | Query categorization (800,000 queries into 67 predefined categories) |
KDD Cup 2007 [+] | ACM SIGKDD | User Ratings | On predicting aspects of movie rating behavior. |
MovieLens Data Sets [+] | GroupLens Research | User Ratings | They currently have three datasets available: (1) 100,000 ratings for 1682 movies by 943 users (2) 1 million ratings for 3900 movies by 6040 users (3) 10 million ratings and 100,000 tags for 10681 movies by 71567 users |
Anonymous Ratings from the online Jester Online Joke Recommender System [+] | Jester 4.0 | User Ratings | Dataset 1: Over 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003. Dataset 2: Over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 63,974 users: collected between November 2006 - May 2009. |
Book crossing Dataset [+] | IIF – Institüt für Informatik – Freiburg | User Ratings | Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. |
Last.fm - Music Recommendation Datasets for Research (2010) [+] | Óscar Celma, UOC | User Ratings | This dataset contains [user, artist, plays] tuples (for ~360,000 users) collected from Last.fm API. |
Reuters 21578 [+] | David Lewis | Text mining | Reuters 21578 |
Web Spam Detection [+] | Yahoo! Research Barcelona | Spam Detection | WEBSPAM-UK2007 and WEBSPAM-UK2006, and older |
The Enron dataset [+] | CMU.edu | Real e-mail content | It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. |
CiteUlike Available Datasets [+] | CiteULike | N/A | The file constitutes an anonymous dump of who posted what and when the posting took place. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available. |
Wikipedia Database Complete Dump [+] | Wkipedia | Text mining | The latest complete dump of the English-language Wikipedia |
The EUR-Lex datasets [+] | TU Darmstadt | Text mining | The EUR-Lex text collection is a collection of documents about European Union law.? The most important categorization is provided by the EUROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law. |
The 4-Universities Dataset [+] | CMU World Wide Knowledge Base | Web Mining | This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the Web->Kb project of the CMU text learning group. The 8,282 pages were manually classified into 7 categories. |
The 4-Universities Dataset (Relational version) [+] | CMU World Wide Knowledge Base | Web Mining - ILP | The data consists of relations suitable for providing to FOIL, as well as the complete text of all the web pages and also of anchors and the text surrounding anchors. |