Posteado por: juditt | abril 16, 2008

Concepts of the World of Translation

The world of translation has an specific vocabulary which would be very useful to know in order to understand the most important concepts of this world. Some of these concepts are the following:

  • Machine Translation: according to the Wikipedia, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. At its basic level, MT performs simple substitution of words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies. Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names. With the assistance of these techniques, MT has proven useful as a tool to assist human translators, and in some cases can even produce output that can be used «as is». However, current systems are unable to produce output of the same quality as a human translator, particularly where the text to be translated uses casual language.
  • Computer-assisted translation: or CAT is a form of translation wherein a human translator translates texts using computer software designed to support and facilitate the translation process. Computer-assisted translation is sometimes called machine-assisted, or machine-aided, translation. Some advanced computer-assisted translation solutions include controlled machine translation (MT). Integration of MT into computer-assisted translation has been implemented in various ways by various parties.
  • Multilingual content management: a multilingual website is usually a mixture of global and local content. Local content presents no particular content management issues; global content – which has to be translated across all language locales – does. Deciding where multiple language versions of content are going to be required and where content can be maintained separately for different locales is a critical decision that will affect how a site should be maintained and what it will cost.
  • Translation technology: translation is the action of interpretation of the meaning of a text, and subsequent production of an equivalent text, also called a translation, that communicates the same message in another language. The text to be translated is called the «source text,» and the language it is to be translated into is called the «target language«; the final product is sometimes called the «target text.»

Sources:

Posteado por: juditt | abril 9, 2008

FEMTI report: characteristics of a translation task

FEMTI (Framework for the Evaluation of Machine Translation) is a resource that helps MT evaluators define contextual evaluation plans, and it consists of two interrelated classifications:

  1. One of them lists the possible characteristics of the contexts of use that can be applied to MT systems.
  2. The other one lists the possible characteristics of an MT system, along with the metrics that were proposed to measure them.

FEMTI is used by many evaluators every day, so it proposes some characteristics of high quality that are relevant to that context, using its embedded knowledge base. Evaluators can modify this set of characteristics and choose an evaluation metric for each of them by browsing the second classification. Then, evaluators can print the evaluation plan and execute the it.

According to the FEMTI report, a translation task has specific characteristics. This refers to the information flow intended for the output, from the point of view of the agent (human or otherwise) who receives the translation. As J.C. Sager noted for Machine Translation systems, two types of use [are] to be considered: (a) the un-edited output; (b) the edited output. The output may be acceptable for either use or both and the evaluation should determine this. In the case of edited output the cost of revision, editing etc. has to be established and compared with the cost of manual translation. Since the type of use is related to the type of text, these types have to be established and taken into account.

In the other hand, In Toward Finely Differentiated Evaluation Metrics for Machine Translation, Hovy suggests dividing all the possible translation tasks into three main groups. He saidthat in order to make the taxonomization of features useful to people who do not already know much about MT and do not wish to become experts in evaluation, it is important to articulate its layers and choices in terms they can intuitively understand. This part of the present evaluation taxonomy describes three principal types of use in such a way that users can identify the particular type of work they want to have done, while developers can define in strict terms what their MT system can do.

Apart from that, the main characteristics of a translation task according to the FEMTI report are the following:

  • Assimilation: its purpose is to monitor a relatively large volume of texts produced by people outside the organization, in several languages. This characteristic includes document routing / sorting, information extraction or summarization and a search process.
  • Dissemination: it pretends to deliver to others a translation of documents produced inside the organization. Internal dissemination (routine internal dissemination and experimental internal dissemination) and external dissemination-publication (single client external dissemination and multi-client external dissemination) are included in here.
  • Communication: its aim is to support multi-turn dialogues between people who speak different languages. The translation quality must be high enough for good and fluent conversation. The ultimate purpose of dissemination is to deliver to others a translation of documents produced inside the organization. Synchronous or interactive communication and synchronous or delayed communication must be mentioned at this point too.

Sources:

Posteado por: juditt | abril 1, 2008

Standford NLP Group’s research topics

The Human Language Technologies include many topics in their research. For example, the members of the Stanford NLP Group pursue research the following topics:

  • Computational Semantics: it consists of extracting the meaning representations from texts, ranging from shallow representations like named entities and thematic roles and to deepen structures like quantification in logic. Among othedr things, it includes Named Entity Recognition (NER) and Information Extraction (IE). This group worked in a wide range of NER and IE related issues and used a Maximum Entropy Markov Model. In 2003 they wanted thatBiomedical papers domined NER, which required identifying genes and proteins but not making a difference between the two.
  • Parsing and Tagging: this consists of assigning part of speech and syntactic structure by algorithms, emphasizing probabilistic and discriminative models. It includes Probabilistic Parsing and Part-of-speech Tagging, which assigns the correct part of speech (verb, adjective, noun etc.) to words. This group has worked on building probabilistic conditional long-linear models for tagging. This tagger can be considered as the best for English uses.
  • Multilingual NLP: this is a variety of NLP investigations in Chinese, Arabic and German. It includes tagging, segmentation, probabilistic syntactic parsing, and semantic role parsing. Inside this froup, we must make a difference between Arabic, German and Chinese NLP. The Chinese one does research in Chinese Natural Language Processing and includes word segmentation, part-of-speech tagging, sintactic and semantic parsing and so on.
  • Unsupervised Induction of Linguistic Structure: this section analyses probabilistic and other corpus-based algorithms with the aim of learning syntactic, morphological and phonological structures from corpora. This research topic includes grammar induction – ‘the more linguistic structure that can be learned, the less need there is for large marked-up corpora -, Semantic Taxonomy Induction – focused on acquiring the basic semantic relationships between words from corpora- and Morphology and Fonology Induction.

Sources:

« Newer Posts - Older Posts »

Categorías