blog posts

How Do We Teach Machines To Understand The Images They See?

How Do We Teach Machines To Understand The Images They See?

Our Society Is More Technologically Advanced Than Ever. We’ve Been Able To Send Humans To The Moon, Make Phones That Talk To Us, Or Design Custom Radio Stations That Can Play Only The Music We Like. 

Teach Machines; however, our most advanced machines and computers still have difficulty understanding images. However, good progress has been made in this area, especially machine vision, which may revolutionize the world of technology.

To date, scientists have been able to build cars that can drive on their own, but without intelligent vision, they can’t tell the difference between a crumpled paper bag on the road that can cross with a rock that should not cross.

To solve this problem. Cameras with excellent megapixel resolution were created that require special software to use them effectively. Drones fly over large areas but do not have enough visual technology to help us track rainforest change.

Security cameras are everywhere, but they do not warn when a child is drowning in a pool. Images and videos are becoming an essential part of world life. Prints are produced at speeds beyond what any human or group of humans can hope to see.

You may ask why this is so difficult?

Cameras can produce images based on the conversion of light into a two-dimensional array of numbers called pixels. Ideas that have no soul and no special meaning. Hearing when was like seeing, taking a picture is not the same as seeing, because by seeing you have to understand.

The human brain or other creatures evolved over millions of years to process the images they see. To be more precise, seeing begins with the eye but does the main thing in the brain.

The first step in this path is to teach computers to see objects; Familiarity themselves with visual world topics. Imagine this training process as showing several training photos of a particular thing such as cats to the computer and designing a model (for the computer) to learn from seeing these photos.

How hard can this be? After all, a cat is a collection of shapes and colors, which should have been done in the early days of object design. You say to a computer algorithm in a mathematical language that a cat has a round face, a plump body, two sharp ears, and a long tail, and that was enough, but what if the cat is at a certain angle?

Now you have to add another shape and angle to the object model, but what if the cats are not all right?

Now you understand what I mean, how complicated this process is.

No one tells a child how to see, especially in the early years when he begins to understand the world around him. They learn this through real-world experience and example.

If you think of a child’s eyes like a pair of biological cameras, they take a picture every 200 milliseconds, the average time it takes for the eye to move. So by the age of three, a child has seen hundreds of millions of images of the natural world, which means an enormous volume of educational examples.

So instead of just focusing on better algorithms, you should pay attention to giving the algorithms that educational data and treating them just like a child.

Once you understand this, the next step will be to gather the information you need.

The more images you have, the better the quality of work and the output of the algorithm. The best resource in this area is the Internet, which hosts the enormous treasure trove of photographs humans have ever taken. For example, in some cases, experts use a billion images and technologies such as CrowdSourcing and the Amazon Mechanical Turk platform to tag them.

To do this, you need a large team that can organize images, modify, sort, and tag images. Interestingly, all of this is equivalent to what a child’s mind does in the early years of development. After some time and experience, the idea of ​​using vast amounts of data to teach computer algorithms seems obvious.

For example, in a 2009 study, the ImageNet project provided a team of artificial intelligence researchers with a database of 15 million images of 22,000 classes of objects arranged in everyday English words.

In terms of quality and quantity, this scale was unprecedented. For example, in the case of cats, more than 62,000 (images) of cats came in a variety of shapes, body shapes, and in all domestic and wild species.

Now that you have the data to power the computer brain, you are ready to go back to the algorithms themselves because projects like ImageNet, access to the categorized and complete information needed to teach machine learning algorithms.

Suppose it is at your disposal. The brain is made up of billions of continuous neurons, is a fundamental operating unit in a neural network like a neuron.

It takes input from other nodes and sends the output to other nodes.

In addition, these hundreds or thousands or even millions of nodes are arranged in hierarchical layers, such as the brain. There are twenty-four million nodes, 140 million parameters, and 15 billion connections to teaching object recognition models in a typical neural network.

It is a considerable model. An intertwined neural network is created using the enormous power of ImageNet and GPU data to teach such a uniform model.

After a tutorial, a network that can tell you by viewing millions of photos that this image contains a cat and where the cat is in the picture, of course, implementing such neural networks is used to go far beyond recognizing cats, and for example, a computer algorithm can tell you in a picture of a boy and a teddy bear; There is a dog, a man, and a small kite in the background, or the image contains more crowded things like a man, skateboards, railings, light poles and more.

Sometimes when computers are unsure of what they are looking at, you have to help them be smart enough to give you a safe answer, just like humans do.

Of course, implementing such neural networks is used to go far beyond recognizing cats, and for example, a computer algorithm can tell you in a picture of a boy and a teddy bear; There is a dog, a man, and a small kite in the background, or the image contains more crowded things like a man, skateboards, railings, light poles and more.

Sometimes when computers are unsure of what they are looking at, you have to help them be smart enough to give you a safe answer, just like humans do.

Of course, implementing such neural networks is used to go far beyond recognizing cats, and for example, a computer algorithm can tell you in a picture of a boy and a teddy bear; There is a dog, a man, and a small kite in the background, or the image contains more crowded things like a man, skateboards, railings, light poles and more.

Sometimes when computers are unsure of what they are looking at, you have to help them be smart enough to give you a safe answer, just like humans do.

For example, if you provide the algorithm with millions of street views of different cities and the coding is done correctly, you will get interesting information so that the algorithm will show you Whether the price of a car is highly dependent on household income or whether the price of a vehicle is highly dependent on the crime rate in cities or the pattern of voting in cities based on postal code.

So far, we have taught the computer to see objects. It is like a child learning to say a few names. It is an incredible success, but it is only the first step.

An important step will soon take, and children will learn to communicate through talking. So instead of saying this is a cat in this picture, the algorithm should be efficient enough to say that a cat is lying on the bed.

To teach the computer to describe the image in sentences, you need to go to the link between the big data and the machine learning algorithm, which is the next step.

The computer now has to learn from both images and human-made natural language sentences.

Just as the brain combines vision and language, you must create a model that links parts of visual objects, such as thumbnails, to words and phrases in sentences.

In general, you should pay attention to the fact that you first give the machines a vision and teach them to see to help you better understand the images.