Prof. Dr. Jose Raul Romero Salguero

last update: 2014

This page contains a summary of resources of interest for research. It is not intended to provide an exhaustive list, but some references to start from. From this page you can also browse to further information on my software and research.

Specific data sets

Data set	Owner	Context	Description
The SEER data [+]	U.S. National Cancer Institute	N/A	Diagnosis
ICML 2009 dataset (2009) [+]	Dept. of UCSD Computer Science and Engineering	Web mining - Usage	Detection of malicious URLs (spam, phishing, exploits, and so on)
The Public Terabyte Dataset Project (2010) [+]	Bixo Labs / Amazon	Web Mining - Content & Usage	The data comes from a crawl of 50-200M pages from the 100K top (by traffic) English language domains.
The Internet Traffic Database (2008) [+]	Lawrence Berkeley National Laboratory / ACM SIGCOMM	Web mining / Usage	HTTP requests on different servers
Web->KB Project [+]	CMU World Wide Knowledge Base	Web Mining - Content / MRDM	It contains: (1) A data set consisting of classified Web pages. (2) A relational data set describing both pages and hyperlinks. (3) A subset of the 4 Universities dataset containing web pages and hyperlink data. (4) 20 newsgroups dataset (5) 7sectors dataset.
Syskill and Webert Web Page Ratings [+]	UCI	User Ratings	To predict user ratings on web pages
KDD Cup 2005 [+]	ACM SIGKDD	N/A	Query categorization (800,000 queries into 67 predefined categories)
KDD Cup 2007 [+]	ACM SIGKDD	User Ratings	On predicting aspects of movie rating behavior.
MovieLens Data Sets [+]	GroupLens Research	User Ratings	They currently have three datasets available: (1) 100,000 ratings for 1682 movies by 943 users (2) 1 million ratings for 3900 movies by 6040 users (3) 10 million ratings and 100,000 tags for 10681 movies by 71567 users
Anonymous Ratings from the online Jester Online Joke Recommender System [+]	Jester 4.0	User Ratings	Dataset 1: Over 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003. Dataset 2: Over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 63,974 users: collected between November 2006 - May 2009.
Book crossing Dataset [+]	IIF – Institüt für Informatik – Freiburg	User Ratings	Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.
Last.fm - Music Recommendation Datasets for Research (2010) [+]	Óscar Celma, UOC	User Ratings	This dataset contains [user, artist, plays] tuples (for ~360,000 users) collected from Last.fm API.
Reuters 21578 [+]	David Lewis	Text mining	Reuters 21578
Web Spam Detection [+]	Yahoo! Research Barcelona	Spam Detection	WEBSPAM-UK2007 and WEBSPAM-UK2006, and older
The Enron dataset [+]	CMU.edu	Real e-mail content	It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages.
CiteUlike Available Datasets [+]	CiteULike	N/A	The file constitutes an anonymous dump of who posted what and when the posting took place. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available.
Wikipedia Database Complete Dump [+]	Wkipedia	Text mining	The latest complete dump of the English-language Wikipedia
The EUR-Lex datasets [+]	TU Darmstadt	Text mining	The EUR-Lex text collection is a collection of documents about European Union law.? The most important categorization is provided by the EUROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law.
The 4-Universities Dataset [+]	CMU World Wide Knowledge Base	Web Mining	This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the Web->Kb project of the CMU text learning group. The 8,282 pages were manually classified into 7 categories.
The 4-Universities Dataset (Relational version) [+]	CMU World Wide Knowledge Base	Web Mining - ILP	The data consists of relations suitable for providing to FOIL, as well as the complete text of all the web pages and also of anchors and the text surrounding anchors.

Repository	Owner
Datasets for Data Mining, Analytics and Knowledge Discovery [+]	KD-nuggets
UC Irvine Machine Learning Repository [+][+]	UCI KDD Archive
AWS Developer Community [+]	Amazon Web Services
Intl. Network for Social Network Analysis [+]	INSNA
Datasets for training [+]	UCLA Statistics datasets
Trust network datasets (social network datasets) [+]	TrustLet.org
Data for Research (by categories) [+]	Daniel Lemire's
Information Retrieval Resources (Niraj) [+]	Niraj Kumar
IR Multilingual Resources at UniNE (Stemming Dictionaries) [+]	Université de Neuchâtel
Public Databases @ Bixo Labs [+]	Bixo Labs
Frequent Itemset Mining Dataset Repository [+]	FIMI workshops (2003/04)
The LUCS-KDD Discretised/Normalised (V2) ARM and CARM Data Library [+]	Frans Coenen
Regression datasets [+]	Luis Torgo (Univ. Porto)
PMML Sample Models [+]	Data Mining Group
EDM datasets [+]	PSLC DataShop
Data Mining and Exploration (for students) [+]	The University of Edinburgh
SWEO Community Project: Linking Open Data on the Semantic Web [+]	W3C
The Text REtrieval Conference (TREC) datasets [+]	US National Institute of Standards and Technology
GoogleLabs Public Data Explorer [+]	Google
The KEEL dataset repository [+]	KEEL Spanish Research Project

Repository

Owner

Datasets for Data Mining, Analytics and Knowledge Discovery [+]

KD-nuggets

UC Irvine Machine Learning Repository [+][+]

UCI KDD Archive

AWS Developer Community [+]

Amazon Web Services

Intl. Network for Social Network Analysis [+]

INSNA

Datasets for training [+]

UCLA Statistics datasets

Trust network datasets (social network datasets) [+]

TrustLet.org

Data for Research (by categories) [+]

Daniel Lemire's

Information Retrieval Resources (Niraj) [+]

Niraj Kumar

IR Multilingual Resources at UniNE (Stemming Dictionaries) [+]

Université de Neuchâtel

Public Databases @ Bixo Labs [+]

Bixo Labs

Frequent Itemset Mining Dataset Repository [+]

FIMI workshops (2003/04)

The LUCS-KDD Discretised/Normalised (V2) ARM and CARM Data Library [+]

Frans Coenen

Regression datasets [+]

Luis Torgo (Univ. Porto)

PMML Sample Models [+]

Data Mining Group

EDM datasets [+]

PSLC DataShop

Data Mining and Exploration (for students) [+]

The University of Edinburgh

SWEO Community Project: Linking Open Data on the Semantic Web [+]

W3C

The Text REtrieval Conference (TREC) datasets [+]

US National Institute of Standards and Technology

GoogleLabs Public Data Explorer [+]

Google

The KEEL dataset repository [+]

KEEL Spanish Research Project

Resource	Owner	Description
WorkGenesis [+]	J.R. Romero (Univ. Córdoba)	Meta-tool for the quick construction of data intensive workflow management systems.
(In house) test set generator v3.2 [+]	Frans Coenen (Univ. Liverpool)	The data generator is intended to produce data sets for use in the testing of Association Rule Mining (ARM) algorithms, but may very well have other uses. Written in Java.
Software from KARYPIS Lab [+]	University of Minnesota	Software on partitioning, clustering, information retrieval, etc.
Orange [+]	BioLab	Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.
Knime [+]	Knime.com AG	KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is used by professionals in both industry and academia in over 60 countries.
RapidMiner [+]	Rapid-I	It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite. Powerful but intuitive graphical user interface for the design of analysis processes. Repositories for process, data and meta data handling.
KEEL [+]	Several spanish universities (Spanish National Projects TIC2002-04036-C05, TIN2005-08386-C05 and TIN2008-06681-C06)	KEEL is an open source (GPLv3) Java software tool to assess evolutionary algorithms for Data Mining problems including regression, classification, clustering, pattern mining and so on. It contains a big collection of classical knowledge extraction algorithms, preprocessing techniques (training set selection, feature selection, discretization, imputation methods for missing values, etc.), Computational Intelligence based learning algorithms, including evolutionary rule learning algorithms based on different approaches (Pittsburgh, Michigan and IRL, ...), and hybrid models such as genetic fuzzy systems, evolutionary neural networks, etc. It allows us to perform a complete analysis of any learning model in comparison to existing ones, including a statistical test module for comparison.
Weka [+]	University of Waikato	Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
Frequent Pattern Mining Implementations (C++)[+]	Bart Goethals	Implementation of several ARM algorithms: Apriori, NDI, Eclat, FP-Growth, DIC, etc.
Frequent Itemset Mining Implementations Repository [+]	FIMI	Implementation of several algorithms for Frequent Itemset Mining

Resource

Owner

Description

WorkGenesis [+]

J.R. Romero (Univ. Córdoba)

Meta-tool for the quick construction of data intensive workflow management systems.

(In house) test set generator v3.2 [+]

Frans Coenen (Univ. Liverpool)

The data generator is intended to produce data sets for use in the testing of Association Rule Mining (ARM) algorithms, but may very well have other uses. Written in Java.

Software from KARYPIS Lab [+]

University of Minnesota

Software on partitioning, clustering, information retrieval, etc.

Orange [+]

BioLab

Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.

Knime [+]

Knime.com AG

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is used by professionals in both industry and academia in over 60 countries.

RapidMiner [+]

Rapid-I

It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite. Powerful but intuitive graphical user interface for the design of analysis processes. Repositories for process, data and meta data handling.

KEEL [+]

Several spanish universities (Spanish National Projects TIC2002-04036-C05, TIN2005-08386-C05 and TIN2008-06681-C06)

KEEL is an open source (GPLv3) Java software tool to assess evolutionary algorithms for Data Mining problems including regression, classification, clustering, pattern mining and so on. It contains a big collection of classical knowledge extraction algorithms, preprocessing techniques (training set selection, feature selection, discretization, imputation methods for missing values, etc.), Computational Intelligence based learning algorithms, including evolutionary rule learning algorithms based on different approaches (Pittsburgh, Michigan and IRL, ...), and hybrid models such as genetic fuzzy systems, evolutionary neural networks, etc. It allows us to perform a complete analysis of any learning model in comparison to existing ones, including a statistical test module for comparison.

Weka [+]

University of Waikato

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

Frequent Pattern Mining Implementations (C++)[+]

Bart Goethals

Implementation of several ARM algorithms: Apriori, NDI, Eclat, FP-Growth, DIC, etc.

Frequent Itemset Mining Implementations Repository [+]

FIMI

Implementation of several algorithms for Frequent Itemset Mining

Specific data sets

Data set repositories

Resources and tools