WebStar - Data Sets

Skip to end of metadata
Go to start of metadata

WebStar - Data Sets

URL: https://webstar.deri.ie/datasets/ Please use the DERI username to log in.

Data Set ID Description Format Raw data access SPARQL
LOD A collection of data sets from the LOD cloud. The directory contains different file formats and sub folders HTTPS NO
BTC09 The official data set of the billion triples challenge. For further detail please see the official data set homepage . You can also find a couple of subdirectories, both containing tgz-files with respectively all the sameAs quads and all the wikilink quads already extracted from the dump. Multiple files in NQ format HTTPS or HTTP NO
ICWSM07 "The International Conference on Weblogs and Social Media (26-28 March 2006, Boulder CO, USA) is offering a large blog dataset to conference participants. The data release comprises a complete set of weblog posts collected by Nielsen BuzzMetrics for May 2006. It consists of about 14M weblog posts in XML format from 3M weblogs and is annotated with 1.7M blog-blog links. The marked-up fields include: date of posting, time of posting, author name, title of the post, weblog url, permalink, tags/categories, and outlinks classified by type. The compressed dataset is over 10GB. In addition to the data, the conference organizers hope to release processing code and a shared repository for those making use of the dataset. ..."Blog post SQL dumps -30GB uncrompressed, 6.7GB compressed (bz2) HTTPS NO
ICWSM09 "The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
This dataset spans a number of big news events (the Olympics; both US presidential nominating conventions; the beginnings of the financial crisis; ...) as well as everything else you might expect to find posted to blogs."ICWSM09
compressed XML (tar.gz) HTTPS NO
SINDICE-DUMP This data set is a sindice dump from March 2009. Most of the sindice dump can be also find in the BTC09 data set NQ format - 100GB uncompressed, 7.5GB gz HTTPS NO
boards.ie
Ten years of discussions from the Irish forum site boards.ie - from year 1998 to 2008.  Transformed from the SIOC format. All forums, threads, users, posts - a general description of the data is available. There is also a graphical representation of the schema available and some simple example queries. Raw data and SPARQL access only available after agreeing to a license, please contact Marcel Karnstedt from the UIMR unit.
RDF/XML
N-Triples
--
YES
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.