|
|
|
Dr. José Raúl Romero Salguero
www.jrromero.net |
|
|
|
|
|
DATASETS
- The SEER data [+]
- Owner: U.S. National Cancer Institute
- Context: N/A
- Description: Diagnosis
- ICML 2009 dataset (2009) [+]
- Owner: Dept. of UCSD Computer Science and Engineering
- Context: Web mining - Usage
- Description: Detection of malicious URLs (spam, phishing, exploits, and so on)
- The Public Terabyte Dataset Project (2010) [+]
- Owner: Bixo Labs / Amazon
- Context: Web Mining - Content & Usage
- Description: The data comes from a crawl of 50-200M pages from the 100K top (by traffic) English language domains.
- The Internet Traffic Database (2008) [+]
- Owner: Lawrence Berkeley National Laboratory / ACM SIGCOMM
- Context: Web mining / Usage
- Description: HTTP requests on different servers
- Web->KB Project [+]
- Owner: CMU World Wide Knowledge Base
- Context: Web Mining - Content / MRDM
- Description: It contains: (1) A data set consisting of classified Web pages. (2) A relational data set describing both pages and hyperlinks. (3) A subset of the 4 Universities dataset containing web pages and hyperlink data. (4) 20 newsgroups dataset (5) 7sectors dataset.
- Syskill and Webert Web Page Ratings [+]
- Owner: UCI
- Context: User Ratings
- Description: To predict user ratings on web pages
- KDD Cup 2005 [+]
- Owner: ACM SIGKDD
- Context: N/A
- Description: Query categorization (800,000 queries into 67 predefined categories)
- KDD Cup 2007 [+]
- Owner: ACM SIGKDD
- Context: User Ratings
- Description: On predicting aspects of movie rating behavior.
- MovieLens Data Sets [+]
- Owner: GroupLens Research
- Context: User Ratings
- Description: They currently have three datasets available: (1) 100,000 ratings for 1682 movies by 943 users (2) 1 million ratings for 3900 movies by 6040 users (3) 10 million ratings and 100,000 tags for 10681 movies by 71567 users
- Anonymous Ratings from the online Jester Online Joke Recommender System [+]
- Owner: Jester 4.0
- Context: User Ratings
- Description: Dataset 1: Over 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003. Dataset 2: Over 1.7 million continuous ratings (-10.00 to +10.00) of 150 jokes from 63,974 users: collected between November 2006 - May 2009.
- Book crossing Dataset [+]
- Owner: IIF – Institüt für Informatik – Freiburg
- Context: User Ratings
- Description: Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.
- Last.fm - Music Recommendation Datasets for Research (2010) [+]
- Owner: Óscar Celma, UOC
- Context: User Ratings
- Description: This dataset contains [user, artist, plays] tuples (for ~360,000 users) collected from Last.fm API.
- Reuters 21578 [+]
- Owner: David Lewis
- Context: Text mining
- Description: Reuters 21578
- Web Spam Detection [+]
- Owner: Yahoo! Research Barcelona
- Context: Spam Detection
- Description: WEBSPAM-UK2007 and WEBSPAM-UK2006, and older
- The Enron dataset [+]
- Owner: CMU.edu
- Context: Real e-mail content
- Description: It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages.
- CiteUlike Available Datasets [+]
- Owner: CiteULike
- Context: N/A
- Description: The file constitutes an anonymous dump of who posted what and when the posting took place. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available.
- Wikipedia Database Complete Dump [+]
- Owner: Wkipedia
- Context: Text mining
- Description: The latest complete dump of the English-language Wikipedia
- The EUR-Lex datasets [+]
- Owner: TU Darmstadt
- Context: Text mining
- Description: The EUR-Lex text collection is a collection of documents about European Union law.? The most important categorization is provided by the EUROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law.
- The 4-Universities Dataset [+]
- Owner: CMU World Wide Knowledge Base
- Context: Web Mining
- Description: This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the Web->Kb project of the CMU text learning group. The 8,282 pages were manually classified into 7 categories.
- The 4-Universities Dataset (Relational version) [+]
- Owner: CMU World Wide Knowledge Base
- Context: Web Mining - ILP
- Description: The data consists of relations suitable for providing to FOIL, as well as the complete text of all the web pages and also of anchors and the text surrounding anchors.
REPOSITORIES
- Datasets for Data Mining, Analytics and Knowledge Discovery [+]
- UC Irvine Machine Learning Repository [+][+]
- AWS Developer Community [+]
- Owner: Amazon Web Services
- Intl. Network for Social Network Analysis [+]
- Datasets for training [+]
- Owner: UCLA Statistics datasets
- Trust network datasets (social network datasets) [+]
- Data for Research (by categories) [+]
- Information Retrieval Resources (Niraj) [+]
- IR Multilingual Resources at UniNE (Stemming Dictionaries) [+]
- Owner: Université de Neuchâtel
- Public Databases @ Bixo Labs [+]
- Frequent Itemset Mining Dataset Repository [+]
- Owner: FIMI workshops (2003/04)
- The LUCS-KDD Discretised/Normalised (V2) ARM and CARM Data Library [+]
- Regression datasets [+]
- Owner: Luis Torgo (Univ. Porto)
- PMML Sample Models [+]
- EDM datasets [+]
- Data Mining and Exploration (for students) [+]
- Owner: The University of Edinburgh
- SWEO Community Project: Linking Open Data on the Semantic Web [+]
- The Text REtrieval Conference (TREC) datasets [+]
- Owner: US National Institute of Standards and Technology
- GoogleLabs Public Data Explorer [+]
- The KEEL dataset repository [+]
- Owner: KEEL Spanish Research Project
BLOGS
- MobBlog [+]
- Description: Research: Trust, Reputation, Recommendations and Mobility
- FlowingData [+]
- Description: 6 Influential Datasets that Changed the Way We Think
- The DataWrangling Blog [+]
- Description: More than 400 dataset bookmarks
RESOURCES & SOFTWARE
- (In house) test set generator v3.2 [+]
- Owner: Frans Coenen (Univ. Liverpool)
- Description: The data generator is intended to produce data sets for use in the testing of Association Rule Mining (ARM) algorithms, but may very well have other uses. Written in Java.
- Software from KARYPIS Lab [+]
- Owner: University of Minnesota
- Description: Software on partitioning, clustering, information retrieval, etc.
- Orange [+]
- Owner: BioLab
- Description: Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Extensions for bioinformatics and text mining. Packed with features for data analytics.
- Knime [+]
- Owner: Knime.com AG
- Description: KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is used by professionals in both industry and academia in over 60 countries.
- RapidMiner [+]
- Owner: Rapid-I
- Description: It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite. Powerful but intuitive graphical user interface for the design of analysis processes. Repositories for process, data and meta data handling.
- KEEL [+]
- Owner: Several spanish universities (Spanish National Projects TIC2002-04036-C05, TIN2005-08386-C05 and TIN2008-06681-C06)
- Description: KEEL is an open source (GPLv3) Java software tool to assess evolutionary algorithms for Data Mining problems including
regression, classification, clustering, pattern mining and so on. It contains a big collection of classical knowledge extraction algorithms, preprocessing techniques (training set selection, feature selection, discretization, imputation methods for missing values, etc.), Computational Intelligence based learning algorithms, including evolutionary rule learning algorithms based on different approaches (Pittsburgh, Michigan and IRL, ...), and hybrid models such as genetic fuzzy systems, evolutionary neural networks, etc. It allows us to perform a complete analysis of any learning model in comparison to existing ones, including a statistical test module for comparison.
- Weka [+]
- Owner: University of Waikato
- Description: Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
- Frequent Pattern Mining Implementations (C++)[+]
- Owner: Bart Goethals
- Description: Implementation of several ARM algorithms: Apriori, NDI, Eclat, FP-Growth, DIC, etc.
- Frequent Itemset Mining Implementations Repository [+]
- Owner: FIMI
- Description: Implementation of several algorithms for Frequent Itemset Mining
| |
|
|
|
|
|
|