The project
Team
Resources
Publications
Supplementary material
See a demonstration
Related projects
Interesting links
Contact us

During the years, the CSTNews corpus was annotated by groups of computational linguists from NILC. The corpus is avaiable below for download and use for research purposes. For the publications describing each annotation, see the "Publications" page.

DOWNLOAD THE ANNOTATED CORPUS (version 6.0, made available at the end of 2017)

Differences to the previous version (version 5.0)
- inclusion of (nominal) coreference annotation for all the source texts, produced under the specifications of IberEval 2017 annotation task (see the corresponding paper here; the complete corpus produced during the IberEval 2017 task and the related information may be viewed in this link)

For each cluster inside the corpus, the following information is available:

  • a folder named "Textos-fonte", with the original source texts (in .txt format) and their titles (in _titulos.txt format) - each file name identifies the numbers of the document and the cluster, the source agency, as well as day, month, year and local time information for the news, whenever these data were available during corpus compilation
  • a folder named "Textos-fonte segmentados", with the original source texts with sentence boundaries delimited by new line characters
  • a folder named "Sumarios", with the following: the manual summary of each document in the cluster (in _sumario_humano.txt format for each document) with some information (the gist of the text, the size of the text in number of words, the intended size of the summary - corresponding to 30% of the source text, the summary, and the actual size of this summary) provided by the human summarizer (in _datos.txt format), the original manual multi-document summary for the cluster (in _sumario_humano.txt format for each cluster) and its corresponding manual extractive summary (in _extrato_humano.txt format), an automatic multi-document summary produced by CSTSumm system (in _sumario_automatico_CSTSumm.txt format for each cluster) and a version of it with sentences manually ordered (in _sumario_automatico_CSTSumm_ordenado_manuamente.txt format for each cluster), and new (manually produced) 5 multi-document abstacts and 5 multi-document extracts in the "Novos sumários" folder (divided in the subfolders "Abstracts" and "Extratos")
  • a folder named "Expressoes temporais", with the temporal expressions manually identified and normalized (with XML tags) for each document according to Baptista et al. (2008) proposal
  • a folder named "RST", with the RST annotation of each document using RSTTool produced by Michael O'Donnell - the documents that were used for computing annotation agreement have their corresponding cluster folder name followed by "-concordanciaRST" string (and there is a folder named "concordancia" inside the RST folder with the evaluated files)
  • a folder named "CST", with the CST annotation of each cluster (for every possible pair of documents in each cluster) using CSTTool - the clusters that were used for computing annotation agreement have their corresponding folder name followed by "-concordanciaCST" string (and there is a folder named "concordancia" inside the CST folder with the evaluated files)
  • a folder named "dls", with subfolders "noun" and "verb", with the source texts with their (10% most frequent) nouns and (all) verbs accompanied by their corresponding Princeton Wordnet synset identification numbers (in the .dls files) and general XML files for all the source texts in the cluster, showing the details of the word sense annotation (as the possible translations of the Portuguese words to English, whether they were manually or automatically translated, the available synsets and the selected one); this annotation was completely manual; for each cluster, there is also a XML file with the corresponding verb and noun ontologies composed by the selected synsets in the Princeton Wordnet
  • a folder named "CX_Tópicos", with one file for each source text, containing its manual subtopic segmentation (in the 't' xml-like tag) as well as the keywords (in the "label" attribute) that represent the corresponding subtopic (right above the xml-like tag), as well as an unique identifier for each subtopic (in the "top" attribute) so that it is possible to look for other occurrences of the same subtopic in the other texts in the cluster (since they are also referenced by the same unique identifier); each folder also comes with a "notasCX.txt" file, which stores information regarding the list of passages belonging to each subtopic, the number of sentences and words of each subtopic, and the presence of each subtopic in the corresponding manual (abstractive) multi-document summary sentences; finally, there is a "_agrupamento_manual.txt" file in each cluster, which summarizes the distribution of subtopics in the texts (in each line, the first column indicates the id of the subtopic, the second column indicates the id of the document, and the third column indicates the id of the sentence that belongs to the indicated subtopic)
  • a folder named "Analise_sintatica", with xml files for each source text and its title with the corresponding syntactical analyses, which were automatically produced by the PALAVRAS parser (Bick, 2000)
  • a folder named "Alignment", with a txt file with a xml-like annotation indicating the source text sentences that were aligned to each (manually created) multi-document summary sentence, as well as the relationship type of each alignment and the human judges that indicated it
  • a folder named "Aspectos", with a txt file with the multi-document manual summary with its sentences annotated according to the aspects they present; aspects, in this sense, are related to the information that the sentences convey, e.g., WHAT, WHERE and WHEN information about some event (based on the TAC proposal for guided summarization task)

In the folder "For all the clusters" (in the root), the (nominal) coreference annotation (according to the IberEval 2017 annotation task) and the ontologies produced during the DLS annotation are available. The complete corpus produced during the IberEval 2017 task and the related information may be viewed in this link.

     

 

NILC - Interinstitutional Center for Computational Linguistics