Can you give me a shorter text, please?
Natural Language Processing (NLP) is an area that seeks to make possible the interaction between humans and machines through the processing of natural languages.
Surely you’ve had to summarize some history for someone else, but have you ever stopped to think about the process behind such this common task? In this sense, it’s necessary to be clear to whom you want to summarize the story, because from that, important points will be selected from it and in which order the (succinct!) information should be said/written so that the listener/reader is more aware or informed what you mean about. Now, it is far less likely that you have wondered if this activity, which we do so often, can be performed with the same parameters by computer systems, also known as automatic summarizers.
[summarization] can be performed (…) by computer systems, also known as automatic summarizers.
Natural Language Processing (NLP) is an area that seeks to make possible the interaction between humans and machines through the processing of natural languages. The first studies that though about that resulted in the automatic translators, very associated to the Cold War (1947 – 1953). Until then, the aim of these translators was to decode the intercepted messages between opposing armies, and for that purpose, this “translation” was performed only by computational codes and/or mathematical languages. However, over the years, the studies noticed that there were other difficulties to be overcome in the translations (as a proposition of a mistaken word/expression) and they need to hire human to revise the texts after this process. At that moment, they realized that many of the problems faced were of a linguistic nature and could only be overcome if the mode of language processing were carried out by means of a detailed and robust description of language using language itself.
Thus, some years later, with research in the area of NLP and the advance of descriptive linguistic theories, it was possible to automatically execute “linguistic tasks” with more precision and correctness. This is where the grammar correctors, the voice recognizers and reproducers come from, which are very common in smartphones today, for example, or Automatic Summarization (AS) systems. The purpose of these systems is to produce a reduced, coherent, cohesive and at the same time informative and generic version (in the sense of not having a specific target audience) of one or more (written) texts that served as a source for the summaries.
This is where the grammar correctors, the voice recognizers and reproducers come from, which are very common in smartphones today, for example, or Automatic Summarization (AS) systems.
Nowadays, most of the information sources we consult is on-line, where the availability and circulation of digital information has increased considerably. To get an idea, a report published by Cisco-Visual-Networking-Index projects that in 2021 the production of information will be 3.3 Zettabyte on the Web!
Even with the virtual space as a motivation, AS systems find a complication to the task: it is almost impossible to have only a single publication or news about a specific event given the large number of newspapers, blogs and social networking posts that are produced. To illustrate that, we consulted an on-line search on the “truckers strike”, within a time-cut of one-year only news texts. As a result, we obtained 62,900 results for the searched term; that is, approximately 63,000 news were circulated on the Web in the last year on the truckers strike that happened in Brazil in 2018. Researchers in AS point out that the relationship between the large amount of information available and the short time the user has to process it is the main motivation for studies in this area. They further propose that AS can be performed only by selecting, cutting and rearranging the sentences of the source texts, or by selecting and rewriting with other words the chosen sentences.
In Table 1[1] we illustrate a curt of (on-line) news. These fragments report on the sixth day of strike organized by the truck drivers in Brazil in 2018. In order to observe, the sentences (S) of the texts were enumerated, ignoring the organization of paragraphs of the texts, resulting in 4 sentences in each of the texts, and 180 words in total.
From Table 1, we can observe the existence of linguistic phenomena: between Sentences 1 and 2 of Text A and Sentence 1 of Text B, there is redundancy (or similarity) of content (such as date information of beginning of the movement) and complementarity (such as details of information). To make an automatic summary, it is necessary that AS system identify these and other possible relationships between sentences from source-texts, based on linguistic information in that. The redundancy, in the cited example, is characterized by presenting important words in common between the two sentences (such as nouns “day” and “country”); already the complementarity, for presenting information in Text B that are not present in Text A, as the number of points blocked on the highways.
The role of the linguist in this initial process is to identify these and other phenomena (such as the contradiction and variation of writing style) and then to raise the characteristics that evidence the occurrence of these phenomena for computer systems to understand and learn to recognize these relationships, such as “if there are equal words between two sentences, the relationship is redundant” for example. This will make it possible to automate summarization later. Table 2 illustrates a summary synthesized from Texts A and B from Table 1.
In the summary of Table 2, we selected the sentences that could represent the subject of the source-texts in order to avoid redundancy and contradiction, and to emphasize the complementarity between the sentences. As a result, we have a text consisting of 70 words and only 3 sentences; in relation to the source-texts, the summary represents around 38% of words and sentences. This “cut” in the original texts, which represents a little more than 70% of the original texts, characterizes the compression ratio, that is, the amount of information that the user of the AS system wants not to be included in its summary.
The future of research in SA in Portuguese Language (…) is moving towards another type of summarization that was placed at the beginning of this text: the rewriting of the sentences.
SA systems still need to consider the flow of information between the sentences of the source-texts: imagine if the last sentence of Text B was the first sentence of the summary – what a mess it would be! Thus, another activity of the researcher is to evaluate the linguistic quality of the final summaries, analyzing the coherence, cohesion and informativeness of the text. If we identify operational errors, it will be necessary to review each of the steps and, possibly, to improve the linguistic descriptions to be implemented in the SA system, later.
The future of research in SA in Portuguese Language, especially those developed by the Inter-Institutional Center for Computational Linguistics (NILC), which is headquartered at the University of São Paulo (USP-São Carlos), is moving towards another type of summarization that was placed at the beginning of this text: the rewriting of the sentences. Thinking about the whole process of systems of this nature, it will be necessary to add another step in the summarization: to automatically predict and rewriting the sentences chosen for the summary. However, for this researches to be developed, it will be important to study more the human behavior in summarizing texts and, consequently, of more detailed linguistic descriptions of this behavior.
[1] To make this example in English, we translate the original texts. But, we left the link to access the original text in Portuguese.