WebStar - Data Sets

Version 7 by jueumb
on May 27, 2009 11:35.

compared with
Version 8 by jueumb
on May 27, 2009 11:36.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (3)

View Page History

|| Data Set ID || Description || Format || Raw data access || SPARQL ||
| LOD | A coolection collection of data sets from the lod cloud. [LOD cloud|http://www.foaf-project.org/]. | The directory contains different file formats and sub folders | [HTTPS |https://webstar.deri.ie/datasets/LOD/] | NO |
| BTC09 | The official data set of the [billion triples challenge|http://challenge.semanticweb.org/]. For further detail please see the [official data set homepage |http://vmlion25.deri.ie/index.html] | Multiple files in [NQ format|http://sw.deri.org/2008/07/n-quads/] | [HTTPS|https://webstar.deri.ie/datasets/btc2009/] or [HTTP|http://vmlion25.deri.ie/index.html] | NO |
| ICWSM07 | "The [International Conference on Weblogs and Social Media|http://www.icwsm.org/] (26-28 March 2006, Boulder CO, USA) is offering a large blog dataset to conference participants. The data release comprises a complete set of weblog posts collected by [Nielsen BuzzMetrics|http://www.nielsenbuzzmetrics.com/] for May 2006. It consists of about 14M weblog posts in XML format from 3M weblogs and is annotated with 1.7M blog-blog links. The marked-up fields include: date of posting, time of posting, author name, title of the post, weblog url, permalink, tags/categories, and outlinks classified by type. The compressed dataset is over 10GB. In addition to the data, the conference organizers hope to release processing code and a shared repository for those making use of the dataset. ..."[Blog post|http://ebiquity.umbc.edu/blogger/2006/09/08/icwsm-2007-weblog-dataset-released/] | SQL dumps \-30GB uncrompressed, 6.7GB compressed (bz2) | [HTTPS|https://webstar.deri.ie/datasets/icwsm2007_preprocessed/] | NO |
| ICWSM09 | "The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed). \\
This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs."[ICWSM09|http://www.icwsm.org/2009/data/] | compressed XML (tar.gz) |[HTTPS|https://webstar.deri.ie/datasets/icwsm2009/] | [HTTPS|https://webstar.deri.ie/datasets/icwsm2009/] | NO |
| SINDICE-DUMP | This data set is a sindice dump from March 2009. Most of the sindice dump can be also find in the BTC09 data set | [NQ format|http://sw.deri.org/2008/07/n-quads/] - \- 100GB uncompressed, 7.5GB gz | [HTTPS|https://webstar.deri.ie/datasets/sindice/] | NO |