blog posts

What is Natural language processing(NLP)?

What is Natural language processing(NLP)?

Natural language processing is a subfield of computer science, information engineering, and artificial intelligence, which deals with the interaction between computers and (natural) human languages, especially how computers should be programmed to handle large volumes of information. Analyze and process natural language.

Challenges in natural language processing often include speech recognition, natural language comprehension, and natural language production.

History of natural language processing

In general, the history of natural language processing dates back to the 1950s, although some can be traced back to earlier periods. In 1950, Alan Turing published an article entitled “Intelligence and the Computing Machine,” which set out what is now known as the Turing experiment as a measure of intelligence.

The Georgetown experiment in 1954 involved the automatic translation of more than sixty sentences from Russian into English. The authors claimed that machine translation would solve the problem in three and five years. However, real progress was much slower, and machine translation funding fell sharply after the ALPAC report in 1966, which found that ten years of research had failed to meet expectations. Little research was done on machine translation until the 1980s, but it was during this time that the first statistical machine translation systems were developed.

Some of the relatively successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system operating in limited “block worlds” with little dictionaries, and ELIZA, a simulation of individual psychotherapy. It was centered and written by Joseph Weizenbaum between 1964 and 1966. ELIZA sometimes created stunning human interactions without using human emotions or thinking information. When the “patient” went beyond a very small knowledge base, Eliza might give a general answer, for example, to “my head hurts” with “Why do you say your head hurts?” He replied.

During the 1970s, many programmers began writing “conceptual ontologies,” which incorporated real-world information into computer-understandable data structures.

Until the 1980s, most natural language processing systems were based on complex handwritten rules.

But since the late 1980s, the introduction of machine learning algorithms for language processing has revolutionized natural language processing. This was due to a steady increase in computational power (see Moore’s law) and to a gradual decline in the dominance of Chomsky’s linguistic theories (e.g., grammar), the theoretical foundations of which lay sculptural linguistics. Which underlies the machine learning method in language processing, he denied.

Some of the early algorithms used in machine learning, including decision trees, created hard if-then systems similar to existing handwriting rules. However, word-component labeling introduced hidden Markov models in natural language processing. Research increasingly focused on statistical models, which made soft, probabilistic decisions based on the attachment of real-value-to-attribute weights—the manufacturers of input data.

The clandestine language models that many speech recognition systems now rely on are examples of such statistical models. Such models are generally more robust when receiving unfamiliar input, especially verbs with errors (common in real-world data). They produce more reliable results when integrated with a larger system consisting of several subtasks.

Many of the early breakthroughs in machine translation were due in particular to the work of IBM Research, which developed more sophisticated statistical models.

These systems were able to utilize multilingual writing bodies produced by the Canadian Parliament and the European Union due to the need to translate government reports into all the official languages ​​of the corresponding government systems. However, most other plans relied on bodies specifically designed for the tasks performed by these systems, which was (and still is) a major limitation to the success of these systems. As a result, much research has been done on effective learning from small amounts of data.

Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms can learn from data that is not manually answered favorably or a combination with responsive data.

This is usually much more difficult than supervised learning, and for a certain amount of input data, it usually produces less accurate results. However, a significant amount of unanswered data is available (including all the content of the World Wide Web), which is usually able to compensate for poor results, provided that the time complexity of the algorithm used is so low that its implementation is feasible.

In the 2010s, machine-learning machine learning techniques and deep neural networks became pervasive in natural language processing, thanks in part to a flurry of research showing that such methods could achieve superior results in many natural language activities.

For example, in language modeling, parsing, and much more. Some popular techniques include using a word embedding to get the meaning of the words and increasing the end-to-end learning of a high-level action (e.g., answering a question) instead of relying on it. On a series of mediation activities (such as labeling components of speech and analysis of dependencies). In some respects, this shift has led to such fundamental changes in the design of NLP systems that deep neural network-based methods can be considered a new and distinct paradigm of natural language processing. For example, the term neural machine translation (NMT) emphasizes that deep learning methods in machine translation directly learn string-to-field transitions and require mediating steps such as word alignment and language modeling in statistical machine translation. (SMT) used, fixes.

Regular NLP vs. Statistical NLP

In the early days, many language processing systems were designed by manually coding rules. Such as writing grammar or creating innovative practices for rooting words. However, this is rarely the case with natural language changes.

Since the famous “statistical revolution” of the late 1980s and mid-1990s. Much research in natural language processing has relied on machine learning.

On the other hand, the machine learning paradigm requires a statistical inference to automatically learn such rules by analyzing large bodies of common real-world examples. Is).

Many different machine learning algorithms have been used for natural language processing tasks. These algorithms receive a large set of “attributes” as input generated from the input data. Some older algorithms, including decision trees, produced rigid systems of if-then rules similar to the common handwriting rule systems of the time. But research has increasingly focused on statistical methods, which make soft, probabilistic decisions based on attaching real-value weights to each input property. Such models have the advantage that they can express the relative certainty of many possible answers instead of just one. Which produces more reliable results when using such a model as a component of a larger system.

Systems based on machine learning algorithms have several advantages over handwriting rules:

  • The learning processes used in machine learning automatically focus on the most common ones. While when writing rules by hand, it is often not clear where to turn attention.
  • Automated
  • learning processes can use statistical inference algorithms to generate models based on unfamiliar input (for example, containing words or structures that have not been seen before). Generously managing such inputs by handwriting rules is generally very difficult, error-prone, and time-consuming – or building systems of handwriting rules to make soft decisions.
  • Plans based on automatic rule learning can be more accurate by simply providing more input data. However, rule-based handwriting systems can only be refined by increasing the complexity of the rules, which is a more difficult task. In particular, the complexity of handwriting-based systems is so complex that systems become increasingly unmanageable. However, generating more data at the input of machine learning systems only requires a corresponding increase in the number of people working. Which usually does not significantly increase the complexity of the annotation process.

Major assignments and assessments

A list of the most researched works in natural language processing follows. Note that some of these tasks have direct applications in the real world, while others serve more as sub-tasks in helping to solve larger problems. Although natural language processing tasks are intertwined, they are often subdivided for convenience. A large category follows.

  • Syntax
  • Semantics
  • Speech

Syntax

  • Grammatical induction
  • Lemmatization
  • Monogamous division
  • Labeling of word components
  • Parsing

Grammatical induction

Produce formal grammar that describes the syntax of a language.

Lemmatization

Deleting morphological terminals is only to restore the base word in a dictionary.

Monogamous division

Divide words into separate morphemes and identify morphemes. The difficulty of this task depends mainly on the complexity of morphemes (e.g., word structure) in the target language.

Monolingualism in English is relatively simple, especially morphological morphology. So it is often possible to skip this activity altogether and use only all possible forms of a word (e.g. “open, opens, opened, opening”) as separate words. Modeling. However, this is not possible in Turkish or Manipuri, a highly cohesive Indian language, as each dictionary entry can take thousands of forms.

Labeling of word components

Having a sentence determines the word component for each word. Many words, especially common words, can act as multiple components of speech. For example, “book” could be a noun (“book on the table”) or a verb (“book a flight”); “Set” can be a noun, verb, or adjective; And “out” can be any of at least five different parts of speech. Some languages ​​have such ambiguities more than others. Languages ​​with little morphology, such as English, are particularly receptive to such ambiguities. The Chinese language also accepts such ambiguity because it is a musical language when making verbs. Such morphology cannot be easily transmitted by the transition institutions used in the spelling.

Parsing

The grammar tree determines a sentence. Grammar in natural languages ​​is ambiguous, and ordinary sentences have multiple possible analyzes. It may come as a surprise that there may be thousands of possible decompositions (most of which, of course, are meaningless to humans).

There are two main types of parsing, dependency parsing, and constituency parsing. Dependency parsing focuses on the relationships between words in a sentence (which marks things like the main object and the predicate). In contrast, hybrid parsing focuses on constructing the parsing tree using a probabilistic text-independent grammar (PCFG). (See also random grammar.)

Semantics

  • Alphabetical semantics
  • Distributive
  • Machine translation
  • Named entity recognition(NER)

Alphabetical semantics

What is the computational meaning of each word in the context of the text?

Distributive semantics

How can we derive semantic representations from data?

Machine translation

Automatically translate text from one language to another. This is one of the most difficult problems and is part of a class of problems that are informally called “AI-complete. All the different kinds of human knowledge are needed to solve them optimally (grammar, semantics). , Real-world facts, etc.).

Named entity recognition

Having a stream of text determines what elements are written in the text, appropriate names such as people or places, and what type of each name is (e.g., person, business, organization). Note that although capital letters can help identify noun entities in languages ​​such as English. This information can not help determine the type of entity and is often inaccurate or inadequate.

For example, the first letter of a sentence is also capitalized, and famous entities are often polysyllabic, only some of which are capitalized. In addition, many other languages ​​in non-Western texts (such as Chinese or Arabic) have no capital letters at all. And even languages ​​with capital letters may not use them consistently to distinguish names. For example, German capitalizes all nouns, regardless of whether they are nouns, and French and Spanish do not capitalize nouns that are adjectives.

Speech

Speech recognition

Having an audio clip of a person or persons speaking, the text display of their speech is determined. This is text-to-speech, and one of the most difficult issues is called “AI-complete” (see above). In natural speech, pauses between consecutive words rarely occur, and therefore speech segmentation is a necessary subdivision for speech recognition (see below). Note that in most spoken languages, sounds representing consecutive letters are hidden from each other in a process called coarticulation. So converting analog (continuous) signals to discrete characters can be difficult. Also, given that individuals with different accents pronounce words in a language, speech recognition software should identify a wide range of inputs that are similar in textual equivalents.

Speech segmentation

Having an audio clip of a person or a group of people talking, they divide it into words. This is a sub-task of speech recognition and is usually integrated with it.

Text to speech

A text, its units, is transmitted, and a spoken representation is produced. Text-to-speech can be used to help people with poor eyesight.

dialogue

In 2018, the first work in this field was published by an artificial intelligence called one the Road and was marketed as a novel. This novel has sixty million words.

Source:https://mediasoft.ir/%d9%be%d8%b1%d8%af%d8%a7%d8%b2%d8%b4-%d8%b2%d8%a8%d8%a7%d9%86-%d8%b7%d8%a8%db%8c%d8%b9%db%8c-natural-language-processing-%da%86%db%8c%d8%b3%d8%aa%d8%9f/