During the years, the CSTNews corpus was
annotated by groups of computational linguists from NILC. The corpus is avaiable below for download and use for
research purposes. For the publications describing each annotation, see the "Publications" page.
6.0, made available at the end of 2017)
Differences to the
previous version (version
- inclusion of (nominal) coreference annotation for all the
source texts, produced under the specifications of
IberEval 2017 annotation task (see the corresponding
here; the complete corpus produced during the IberEval
2017 task and the related information may be viewed in
this link)
For each cluster inside the corpus, the following information is
- a folder named "Textos-fonte", with the
original source texts (in .txt format) and their titles (in _titulos.txt
format) - each file name identifies the numbers of the document
and the cluster, the source agency, as well as day, month, year
and local time information for the news, whenever these data
were available during corpus compilation
- a folder named "Textos-fonte segmentados",
with the original source texts with sentence boundaries
delimited by new line characters
- a folder named "Sumarios", with the
following: the manual summary of each document in the cluster
(in _sumario_humano.txt format for each document) with some
information (the gist of the text, the size of the text in
number of words, the intended size of the summary -
corresponding to 30% of the source text, the summary, and the
actual size of this summary) provided by the human summarizer
(in _datos.txt format), the original manual multi-document summary for
the cluster (in _sumario_humano.txt format for each cluster) and its
corresponding manual extractive summary (in _extrato_humano.txt
format), an automatic multi-document summary produced by CSTSumm
system (in _sumario_automatico_CSTSumm.txt format for each
cluster) and a version of it with sentences manually ordered (in
_sumario_automatico_CSTSumm_ordenado_manuamente.txt format for
each cluster), and new (manually produced) 5 multi-document
abstacts and 5 multi-document extracts in the "Novos sumários"
folder (divided in the subfolders "Abstracts" and "Extratos")
- a folder named "Expressoes temporais", with
the temporal expressions manually identified and normalized
(with XML tags) for each document according to
Baptista et al. (2008) proposal
- a folder named "RST", with the RST annotation
of each document using
RSTTool produced by Michael O'Donnell - the documents that
were used for computing annotation agreement have their
corresponding cluster folder name followed by "-concordanciaRST"
string (and there is a folder named "concordancia" inside the
RST folder with the evaluated files)
- a folder named "CST", with the CST annotation
of each cluster (for every possible pair of documents in each
cluster) using CSTTool - the clusters that were used for
computing annotation agreement have their corresponding folder
name followed by "-concordanciaCST" string (and there is a
folder named "concordancia" inside the CST folder with the
evaluated files)
- a folder named "dls", with
subfolders "noun" and "verb", with the source texts
with their (10% most frequent) nouns and (all) verbs accompanied by their
corresponding Princeton Wordnet synset identification numbers (in the .dls
files) and general XML files for all the source texts in the
cluster, showing the details of the word sense annotation (as
the possible translations of the Portuguese words to English,
whether they were manually or automatically translated, the
available synsets and the selected one); this annotation was
completely manual; for each cluster, there is also a XML file
with the corresponding verb and noun ontologies composed by the selected
synsets in the Princeton Wordnet
- a folder named "CX_Tópicos", with one file
for each source text, containing its manual subtopic
segmentation (in the 't' xml-like tag) as well as the keywords
(in the "label" attribute) that represent the corresponding
subtopic (right above the xml-like tag), as well as an unique
identifier for each subtopic (in the "top" attribute) so that it
is possible to look for other occurrences of the same subtopic
in the other texts in the cluster (since they are also
referenced by the same unique identifier); each folder also
comes with a "notasCX.txt" file, which stores information
regarding the list of passages belonging to each subtopic, the
number of sentences and words of each subtopic, and the presence
of each subtopic in the corresponding manual (abstractive)
multi-document summary sentences; finally, there is a
"_agrupamento_manual.txt" file in each cluster, which summarizes
the distribution of subtopics in the texts (in each line, the
first column indicates the id of the subtopic, the second column
indicates the id of the document, and the third column indicates
the id of the sentence that belongs to the indicated subtopic)
- a folder named "Analise_sintatica", with xml
files for each source text and its title with the corresponding
syntactical analyses, which were automatically produced by the
PALAVRAS parser (Bick, 2000)
- a folder named "Alignment", with a txt file
with a xml-like annotation indicating the source text sentences
that were aligned to each (manually created) multi-document summary sentence, as well as the
relationship type of each alignment and the human judges that
indicated it
- a folder named "Aspectos", with a txt file
with the multi-document manual summary with its sentences
annotated according to the aspects they present; aspects, in
this sense, are related to the information that the sentences
convey, e.g., WHAT, WHERE and WHEN information about some event
(based on the
TAC proposal for guided summarization task)
In the folder "For all the clusters" (in the root), the
(nominal) coreference annotation (according to the
IberEval 2017 annotation task) and the ontologies produced
during the DLS annotation are available. The complete corpus
produced during the IberEval 2017 task and the related
information may be viewed in
this link.