Prof. Jose RAUL ROMERO - Dpto. Informatica y Analisis Numerico

Esta web necesita javascript para funcionar.
Por favor habilite esta opción en su navegador para continuar.

This web requires javascript to work properly.
Please enable this capability in your browser to continue.

Dr. José Raúl Romero Salguero
www.jrromero.net


	Home Español

Datasets	Repositories	Blogs	Resources & Software

DATASETS

The SEER data [+]

Owner: U.S. National Cancer Institute
Context: N/A
Description: Diagnosis

ICML 2009 dataset (2009) [+]

Owner: Dept. of UCSD Computer Science and Engineering
Context: Web mining - Usage
Description: Detection of malicious URLs (spam, phishing, exploits, and so on)

The Public Terabyte Dataset Project (2010) [+]

Owner: Bixo Labs / Amazon
Context: Web Mining - Content & Usage
Description: The data comes from a crawl of 50-200M pages from the 100K top (by traffic) English language domains.

The Internet Traffic Database (2008) [+]

Owner: Lawrence Berkeley National Laboratory / ACM SIGCOMM
Context: Web mining / Usage
Description: HTTP requests on different servers

Web->KB Project [+]

Owner: CMU World Wide Knowledge Base
Context: Web Mining - Content / MRDM
Description: It contains: (1) A data set consisting of classified Web pages. (2) A relational data set describing both pages and hyperlinks. (3) A subset of the 4 Universities dataset containing web pages and hyperlink data. (4) 20 newsgroups dataset (5) 7sectors dataset.

Syskill and Webert Web Page Ratings [+]

Owner: UCI
Context: User Ratings
Description: To predict user ratings on web pages

KDD Cup 2005 [+]

Owner: ACM SIGKDD
Context: N/A
Description: Query categorization (800,000 queries into 67 predefined categories)

KDD Cup 2007 [+]

Owner: ACM SIGKDD
Context: User Ratings
Description: On predicting aspects of movie rating behavior.

MovieLens Data Sets [+]

Owner: GroupLens Research
Context: User Ratings
Description: They currently have three datasets available: (1) 100,000 ratings for 1682 movies by 943 users (2) 1 million ratings for 3900 movies by 6040 users (3) 10 million ratings and 100,000 tags for 10681 movies by 71567 users

Anonymous Ratings from the online Jester Online Joke Recommender System [+]

Owner: Jester 4.0
Context: User Ratings
Description: Dataset 1: Over 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003. Dataset 2: Over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 63,974 users: collected between November 2006 - May 2009.

Book crossing Dataset [+]

Owner: IIF – Institüt für Informatik – Freiburg
Context: User Ratings
Description: Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

Last.fm - Music Recommendation Datasets for Research (2010) [+]

Owner: Óscar Celma, UOC
Context: User Ratings
Description: This dataset contains [user, artist, plays] tuples (for ~360,000 users) collected from Last.fm API.

Reuters 21578 [+]

Owner: David Lewis
Context: Text mining
Description: Reuters 21578

Web Spam Detection [+]

Owner: Yahoo! Research Barcelona
Context: Spam Detection
Description: WEBSPAM-UK2007 and WEBSPAM-UK2006, and older

The Enron dataset [+]

Owner: CMU.edu
Context: Real e-mail content
Description: It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages.

CiteUlike Available Datasets [+]

Owner: CiteULike
Context: N/A
Description: The file constitutes an anonymous dump of who posted what and when the posting took place. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available.

Wikipedia Database Complete Dump [+]

Owner: Wkipedia
Context: Text mining
Description: The latest complete dump of the English-language Wikipedia

The EUR-Lex datasets [+]

Owner: TU Darmstadt
Context: Text mining
Description: The EUR-Lex text collection is a collection of documents about European Union law.? The most important categorization is provided by the EUROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law.

The 4-Universities Dataset [+]

Owner: CMU World Wide Knowledge Base
Context: Web Mining
Description: This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the Web->Kb project of the CMU text learning group. The 8,282 pages were manually classified into 7 categories.

The 4-Universities Dataset (Relational version) [+]

Owner: CMU World Wide Knowledge Base
Context: Web Mining - ILP
Description: The data consists of relations suitable for providing to FOIL, as well as the complete text of all the web pages and also of anchors and the text surrounding anchors.

top

REPOSITORIES

Datasets for Data Mining, Analytics and Knowledge Discovery [+]

Owner: KD-nuggets

UC Irvine Machine Learning Repository [+][+]

Owner: UCI KDD Archive

AWS Developer Community [+]

Owner: Amazon Web Services

Intl. Network for Social Network Analysis [+]

Owner: INSNA

Datasets for training [+]

Owner: UCLA Statistics datasets

Trust network datasets (social network datasets) [+]

Owner: TrustLet.org

Data for Research (by categories) [+]

Owner: Daniel Lemire's

Information Retrieval Resources (Niraj) [+]

Owner: Niraj Kumar

IR Multilingual Resources at UniNE (Stemming Dictionaries) [+]

Owner: Université de Neuchâtel

Public Databases @ Bixo Labs [+]

Owner: Bixo Labs

Frequent Itemset Mining Dataset Repository [+]

Owner: FIMI workshops (2003/04)

The LUCS-KDD Discretised/Normalised (V2) ARM and CARM Data Library [+]

Owner: Frans Coenen

Regression datasets [+]

Owner: Luis Torgo (Univ. Porto)

PMML Sample Models [+]

Owner: Data Mining Group

EDM datasets [+]

Owner: PSLC DataShop

Data Mining and Exploration (for students) [+]

Owner: The University of Edinburgh

SWEO Community Project: Linking Open Data on the Semantic Web [+]

Owner: W3C

The Text REtrieval Conference (TREC) datasets [+]

Owner: US National Institute of Standards and Technology

GoogleLabs Public Data Explorer [+]

Owner: Google

The KEEL dataset repository [+]

Owner: KEEL Spanish Research Project

top

BLOGS

MobBlog [+]

Description: Research: Trust, Reputation, Recommendations and Mobility

FlowingData [+]

Description: 6 Influential Datasets that Changed the Way We Think

The DataWrangling Blog [+]

Description: More than 400 dataset bookmarks

top

RESOURCES & SOFTWARE

(In house) test set generator v3.2 [+]

Owner: Frans Coenen (Univ. Liverpool)
Description: The data generator is intended to produce data sets for use in the testing of Association Rule Mining (ARM) algorithms, but may very well have other uses. Written in Java.

Software from KARYPIS Lab [+]

Owner: University of Minnesota
Description: Software on partitioning, clustering, information retrieval, etc.

Orange [+]

Owner: BioLab
Description: Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.

Knime [+]

Owner: Knime.com AG
Description: KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is used by professionals in both industry and academia in over 60 countries.

RapidMiner [+]

Owner: Rapid-I
Description: It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite. Powerful but intuitive graphical user interface for the design of analysis processes. Repositories for process, data and meta data handling.

KEEL [+]

Owner: Several spanish universities (Spanish National Projects TIC2002-04036-C05, TIN2005-08386-C05 and TIN2008-06681-C06)
Description: KEEL is an open source (GPLv3) Java software tool to assess evolutionary algorithms for Data Mining problems including regression, classification, clustering, pattern mining and so on. It contains a big collection of classical knowledge extraction algorithms, preprocessing techniques (training set selection, feature selection, discretization, imputation methods for missing values, etc.), Computational Intelligence based learning algorithms, including evolutionary rule learning algorithms based on different approaches (Pittsburgh, Michigan and IRL, ...), and hybrid models such as genetic fuzzy systems, evolutionary neural networks, etc. It allows us to perform a complete analysis of any learning model in comparison to existing ones, including a statistical test module for comparison.

Weka [+]

Owner: University of Waikato
Description: Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Frequent Pattern Mining Implementations (C++)[+]

Owner: Bart Goethals
Description: Implementation of several ARM algorithms: Apriori, NDI, Eclat, FP-Growth, DIC, etc.

Frequent Itemset Mining Implementations Repository [+]

Owner: FIMI
Description: Implementation of several algorithms for Frequent Itemset Mining

top