Some ongoing projects
Text summarization and opinion mining
OPINANDO: Opinion Mining for Portuguese Concept-based Approaches and Beyond - investigation of issues of concept-level analysis for the Brazilian Portuguese language
Aspect ontologies - groups of (hierarchically organized) opinion aspects for supporting opinion mining tasks, including the domains of smartphones, digital cameras and books, in OWL format
OpCluster-PT - an automatic opinion aspect clustering tool (in python) based on linguistic knowledge, as described in the MSc Dissertation of Vargas (2017)
NILC-WISE - Web Interface for Summary Evaluation - an online and easy to use interface for running ROUGE (Lin, 2004) for evaluating summaries
Models for summary coherence evaluation - a set of implemented models for summary coherence evaluation, following several approaches, from traditional entity grids to discourse grids. See the PhD thesis of Marcio de Souza Dias for more information.
Summarization extension to Google Chrome - extension for on-line news summarization, based on RSumm system
RC-4 multi-document summarizer - based on the best RST & CST-based summarization strategy proposed by Cardoso (2014)
RCT-4 multi-document summarizer - based on the best RST & CST & subtopics-based summarization strategy proposed by Cardoso (2014). Notice that the difference of this summarization method in relation to the above one is the inclusion of subtopic segmentation and treatment.
Text-summary alignment - tool that includes a set of methods for aligning texts and their multi-document summaries, as developed by Agostini et al. (2014)
TextTiling for Portuguese - topical segmentation tool adapted to news texts in Brazilian Portuguese, based on the work of Hearst (1997)
ViSum - a visualization system for multi-document summarization (described by Lima, 2013)
CSTSumm - a multi-document summarizer based on CST information (see README.txt in the rar file)
RSumm - a multi-document summarizer based on the relationship maps proposed by Salton et al. (1997)
Sentence ordering program - program for ordering sentences in a multi-document summary (given the source-texts)
Corpus of automatic multi-document summaries with linguistic errors - a corpus of automatic multi-document summaries (for the texts of CSTNews corpus) produced by 4 different summarizes with varied performances, manually annotated with linguistic errors. See the readme file for more details.
OpiSums-PT - a corpus of (extractive and abstractive) opinion summaries (170, in total) for reviews of books (13 reviews) and electronic products (4 reviews), written in Brazilian Portuguese
CSTNews - a corpus with 50 clusters of news texts - in Portuguese - with their multi-document summaries, as well as several discourse and semantic annotations
TeMário 2006 - 150 news texts and the corresponding human summaries, which complement the original TeMário corpus, resulting in a corpus of 250 texts for summarization purposes
GEI - Ideal Extracts Generator for Brazilian Portuguese - given the source text and its corresponding manual (human) summary, GEI generates the ideal extract (which is the juxtaposition of sentences from the source text that best correlate with the sentences of the manual summary) using Salton's cosine measure
DMSumm - Discourse Modeling SUMMarizer
NeuralSumm - NEURAL network for SUMMarization (for scientific texts) - with tools for training the system with new data, if necessary
GistSumm - GIST SUMMarizer
Text and discourse analysis
CSTNews interface - access to 50 clusters of news texts and their multidocument summaries, with texts annotated according to the Cross-document Structure Theory
CSTTool - a semi-automatic edition tool for annotating texts according to the Cross-document Structure Theory
CSTParser - a state-of-the-art CST discourse parser for Portuguese, using both symbolic and machine learning techniques (see Maziero, 2012)
--> Its stand-alone (offline) version (with some adaptations in relation to the online version) is also freely available for use
LIWC - Linguistic Inquiry and Word Count is a text analysis software program that calculates the degree to which people use different categories of words across a wide array of texts. The available resource is a version of its dictionary for Brazilian Portuguese language. See the original project here and the Brazilian version here. The corresponding publication for Portuguese may be found here.
Newshead - an on-line tool for searching and clustering related news
DiZer - DIscourse analyZER for Brazilian Portuguese (mainly for Computer Science domain)
DiZer 2.0 - an on-line version of DiZer, which is easily adaptable and portable to different text types/genres and languages
RSTeval - tool for discourse parsing evaluation, following Marcu (2000) evaluation method - the tool is able to compare RST trees (automatically or manually produced), producing precision and recall numbers
Syntax-based text segmentation tool aiming at producing elementary discourse units for discourse parsing - it uses the parser PALAVRAS (Bick, 2000) for analyzing the input text and, then, applies syntactical segmentation rules
CorpusTCC - corpus of 100 Brazilian Portuguese scientific texts (from Computer Science domain - introduction sections of theses), marked by Marcu's RSTTool (using this relation set), used for developing DiZer
RST Toolkit - utility programs for processing RST files, offering several computational facilities for both computational and linguistic purposes
RhetDB - Rhetorical Database - an edition environment for handling the rhetorical analyses produced by Daniel Marcu's RSTTool; it offers several computational facilities for both computational and linguistic purposes
(this is an old version of the software; for better and more advanced features, use RST Toolkit above)
RHETALHO corpus annotated with Daniel Marcu's RSTTool, its annotation protocol and the relation set; this corpus consists of forty texts - 20 from Computer Science domain and 20 from the on-line newspaper Folha de São Paulo (7 from Cotidiano Section, 7 from Mundo Section and 6 from Science Section) annotated by 2 humans experts in RST
sucinto - summarization for clever information access - investigation and exploration of multi-document summarization strategies for providing a more feasible and intelligent access to on-line information from news agencies
Tools and resources available at PorSimples webpage
Text mining and information extraction
Tools and resources available at Sickle Cell Anemia Project webpage
VisualTCA - an on-line tool for sentence alignment visualization
Trapezio - Translation Post-Editor
Neologism detection tool - a tool for detecting possible neologisms in Portuguese
There is also an old version of the program: filtering program - looking for words in a text that are not contained in dictionaries. Some pre-processed dictionaries you can try - dictionary for Brazilian Portuguese, REPENTINO and Unitex-PB.
Redutor - software tool for reduction between DCG and LFG
Redutor 2 - software tool for reduction between DCG, LFG and GPSG
Lemmatizer for Portuguese - based on the MXPOST part of speech tagger and UNITEX dictionaries for Portuguese, this tool produces the lemmas of the words of a text stored in a plain text file. The source code is also provided. For more details, see the readme.pdf file or contact Erick G. Maziero (the developer of the system).
TeP 2.0 - on-line version of a thesaurus por Brazilian Portuguese
NCLEANER trained model for Portuguese - a trained model to be used with NCleaner (Evert, 2008) for cleaning web pages in Portuguese. The model was trained with 184 texts from several online sources, as Terra, UOL, BBC, Exame, Estadão, IG, R7, Zero Hora, G1, JB Online, and O Globo, among others.
SENTER for Portuguese and for English
how to use it:
In command line, execute the following: senter.exe myfile.txt
The segmented text will be stored in a file with the same name + ".seg" (for instance, myfile.txt.seg) with one sentence per line. The input file must be a plain text file.
Naive-Bayes classifier for Windows (Delphi source code included)
Pre-processing program: substituting numbers, sites and e-mails by generic concepts in texts
NASP (see NASP++ below) - a tool for aiding in word sense annotation of nouns in Portuguese, using Princeton Wordnet as sense repository
NASP++ - an improved version of NASP (see above), with more facilities (e.g., the underlying generation of ontologies for the annotated words) and adapted to other part of speech tags
MulSEN - a multilingual version of NASP (see above)