5 I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong Emanuel Diamant VIDIA-mant, Kiriat Ono Israel 1. Introduction Among the five human senses through which we explore our surrounding, vision takes a unique and a remarkable place. The lion part of information about our near, medium, and distant environment comes to us via the vision channel. It is, therefore, not surprising that almost a half of our cortex is devoted to visual information processing (Milner & Goodale, 1998). In the course of millions of years of evolution, we have even developed a very special attitude to it – we feel an everlasting “hunger” for new visual information. We are “Infovores”, as Irving Biederman (Biederman & Vessel, 2006), one of the founders of the contemporary vision theory, wittily defined. Maybe, this perpetual yearning is the incentive that made us so inclined to various forms of visual information gathering and accumulation. The story about explosive expansion of camera phones may be a good example here: At the end of the year 2007, Nokia manage to sell almost 440 million mobile phones (obviously, each one equipped with a tiny video camera) which accounted for 40% of all global mobile phone sales (Nokia, 2008). That means, more than a billion mobile phones have been soled worldwide only in one last year! By the late 2009, the total number of camera phones will exceed that of both conventional and digital cameras shipped since the invention of photography (Thevenin et al., 2008). The result is – an unprecedented and previously unknown flood of visual information in our environment. According to a leading market research firm, Internet video consumption has increased by nearly 100% over the past year: from an average of 700 terabytes/day in 2006, to 1200 terabytes/day in 2007. Internet video uploads have reached 500K uploads/day in 2007 and will grow to 4800K in 2011 (Mobile video, 2008). That places an urgent demand for a new and previously unknown way of visual information flow handling and management. Certainly, it must be human-like and human- compatible, because Human Visual System (HVS) is the sole information processing system we know that is capable to cope with such problems. However, by saying that we immediately fall into a trap – we don’t know how HVS so perfectly performs its duties. What we do know is that video data sampled by 126 millions of photoreceptors at the eye’s retina is immediately converted (as long as the visual input propagates from the eyes to the higher brain processing levels) into meaningful disjointed visual objects, of various complexities. It must be stressed again and again – we do not know Frontiers in Brain, Vision and AI 96 how this semantic segmentation is accomplished. But we certainly know that the bulk of visual processing accomplished in the human’s brain is performed at the semantic information processing level. Artificial visual systems that we have tirelessly attempted to construct over the last half of a century have always lacked such an ability. The bulk of visual processing carried out in artificial visual systems is constrained to visual data processing only: Pure, exhaustive data processing and nothing more than that. The apparent difference and incompatibility between these two image processing modalities – pure low-level data processing in human-made visual systems and enigmatic high-level semantic information processing in natural human visual systems – is often overlooked and commonly misinterpreted in the computer vision community. This leads to many funny things that are ubiquitous in computer vision design practice, but they seem far less funny when the production scale of such lapses is regarded. Here are some examples: The perceptual quality of an image is usually strictly tied with image primary resolution. More pixels in a frame – more valued is the image. Undeniably, this philosophy is the driving force behind the race for megapixel-large image sensors for portable phone cameras, or the High Definition Television Standard for stationary devices. In each case, image high resolution is directly associated with an extremely high volume of raw image data. Communication bandwidth constraints, power-on-hand limits and other design restrictions request effective signal compression techniques to be used in such data abundant cases. Indeed, carefully designed and skillfully adjusted compression/decompression (encoding/decoding) techniques are generally implemented. Their prime and single purpose: to reduce the data-handling burden. But in the end, the compressed/ decompressed image data would be always again presented to a human observer for final treatment and semantic information processing. A smart design approach would attempt from the very beginning to encode the semantic objects buried in the image data and to deliver only them to the human disposal. That is exactly what the MPEG-4 Standard designers have in mind when they have introduced the standard’s innovative features: VO (Visual Object), VOP (Visual Object Plane), VOL (Visual Object Level). That happened in the year 1994 (Puri & Eleftheriadis, 1998), and expectations for the new video code were very high. However, as the time passed, nothing has come about in the field. And for a very simple and sad reason: visual object is a semantic entity, which cannot be attained by data manipulations. Standard designers were aware of this problem, and for this reason nothing was said about the way the visual objects have to be discovered and delineated. Hence, all further improvements and modifications the standard went through (and there were a lot of them, the last version of the standard is even named differently – H.264 or MPEG-4 Advanced Video Coding (H.264/AVC)) are concerned only with data coding improvements (Sullivan & Wiegand, 2005). The consequences of this are easily imaginable: for stationary environments where power dissipation, processing speed limitations and cost restrictions are not a concern, extremely powerful DSPs (Digital Signal Processors) like Analog Devices TigerSHARC ADSP-TS201S with 3.6 GFLOPs processing power are put into work. For those who are not satisfied with such a might – BittWare offers a PCI Mezzanine Card featuring four TigerSHARCs on a single board with a general processing power of 57 GFLOPs (Bittware, 2007). For the mobile applications, where the restrictions are stern and fixed, the only possible solution is to compromise on image resolution (size). While the sensor resolution has I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong 97 steadily grown from 1.5 Megapixels (1280x1024) to 5 (2580x1930), 8 (3264x2444), 12 (4220x2820), and at last 14 Megapixels (4570x3050), the actually operated camera-phone images were of the size 80x60 pixels, or 160x120, or finally 352x288 pixels, which is the CIF (Common Intermediate Format) Standard. That is all what the infovore people can get in the real life circumstances. Another field where vision technology is extensively used is video surveillance. Spurred by increasing public and private security concerns (especially after 9/11), video surveillance systems market is observing an unprecedented growth and expansion. Global video surveillance camera revenue is forecast to grow from $4.9 billion in 2006 to more than $9 billion by 2011 (Video surveillance, 2007). The general stance is that the driving force behind this expansion is the networked Internet Protocol (IP) video surveillance cameras and IP video servers. Indeed, the IP technology provides the basis for a great leap in video surveillance systems design. However, it has a serious drawback: in terms of useful image resolution the 352x288 CIF standard is the predominating one. From the standpoint of a surveillance system user, the quality of an image in such a system is very dubious. But not this peculiar feature is now in the focus of our concern – visual surveillance implies that the delivered picture is examined and analyzed for scene changes and suspicious event developments (to be detected) and appropriated countermeasures triggered in response. As it was already explained above, this is a sheer semantic information-processing task that only a human being can perform, and none of the existing video surveillance systems can cope with such a task autonomously. What fallows from this, is that a human observer must be attached to the system’s display forever: 24 hours a day/7 days a week/52 weeks a year. Otherwise the system is ineffective and useless. However, for such monotonous and boring work humans are the worst candidates. But who cares? To save on expenses, the observer’s display usually contains not a single camera output, but is shared between 4, 8, and even 16 camera outputs. The effectiveness of such surveillance systems is less than illusive. The arrogant indifference to human/machine disparities in visual stuff handling is celebrating again. And the market - keeps on growing continuously, every time more and more. 2. Tears do not solve problems The urgent need for machine-based visual systems, which are capable of processing visual information in a human-like intelligent manner, is well understood and widely acknowledged today. Impressive research programs that European Commission runs under its auspice are a good example for this understanding. But the scale of the efforts and billions of Euro put into the enterprise (European IST Research, 2006), cannot explain the lack of the progress we witness during all phases of the projects development. We are now in the 7th Framework Programme (FP7), but nothing serious has happen, and the things seemed to be stalled in a dead-end alley. That is a proper moment to check again the basic principles we adhere to when we are pursuing our routine research goals. Since we are aimed on human-like visual information processing, we first have to scrutinize the available knowledge about the HVS performance and then to analyse how this knowledge is used in modelling various human-like image- processing tasks. The classical paradigm of human visual information processing has been established few decades ago by the seminal works of David Marr (Marr, 1978; Marr, 1982), Anne Treisman Frontiers in Brain, Vision and AI 98 (Treisman & Gelade, 1980), Irving Biederman (Biederman, 1987), and a large group of their associates and followers. Treisman’s “Feature-integration theory” (Treisman & Gelade, 1980) is considered as the most fitting incarnation of the idea. It regards human visual information processing as an interplay of two inversely directed processing streams. One is an unsupervised, bottom-up directed process of initial image information pieces discovery and localization. The other is a supervised, top-down directed process, which conveys the rules and the knowledge that guide the linking and binding of these disjoint information pieces into perceptually meaningful image objects. Essentially, as an idea, this conception was not entirely new. About two hundred years ago, Kant had depicted the “faculty of (visual) apperception” as a “synthesis” of two constituents: the raw sensory data and the cognitive “faculty of reason” (Hanna, 2004). A century later, Herman Ludwig Ferdinand von Helmholtz (the first who scientifically investigated our senses) had reinforced this view, positing that sensory input and perceptual inferences are different, yet inseparable, faculties of human vision (Gregory, 1979). The novelty of the modern approach was in an introduction of a new concept used for the idea clarification - “visual information” (Marr, 1978). However, a suitable definition of the term was not provided, and the mainstream of relevant biological research has continued (and continues today) to investigate the puzzling duality of the phenomenon by capitalizing on traditional vague definitions of the matters: local and global image content, perceptual and cognitive image processing, low-level computer-derived image features versus high-level human-derived image semantics (Barsalou, 1999; Palmeri & Gauthier, 2004). Putting aside the terminology, the main problem of human visual information processing remains the same: in order to fulfill the intuitively effortless low-level information pieces agglomeration into meaningful semantic objects, the system has to be provided with some high-level knowledge about the rules of this agglomeration. Needless to say, such rules are usually not available. In biological vision research, this dilemma is known as the “binding problem”. Its importance was recognized at very early stages of vision research, and massive efforts have been directed into it in order to reach a suitable and an acceptable solution. Despite the continuous efforts, any discernable success has not been achieved yet. (For more details, see Treisman (1996) and the special issue of Neuron (vol. 24, 1999), entirely devoted to this problem). Unable to reach the required high-level processing (binding) rules, vision research took steps in a forbidden, but possibly an appealing and an enticing direction – to try to derive the needed high-level knowledge from the available low-level information pieces. A rank of theoretical and experimental work has been done in order to support and justify this just- mentioned shift in research aspirations. Two approaches could be distinguished in this regard: chaotic attractor modeling approach (McRae, 2004; Johanson & Lansner, 2006), and saliency attention map modeling approach (Treue, 2003; Itti, 2005). There is no need to review the details of these approaches here. I will only make a note that both of them presume low-level bottom-up processing as the most proper way for high-level information recovery. Both are computationally expensive. Both definitely violate the basic assumption about the leading role of high-level knowledge in the low-level information processing. In computer vision, the situation is even more bizarre. In fact, computer vision community is so busy with its everyday problems that there is no time to raise basic research ventures. Principal ideas (and their possible solutions) are usually borrowed from biological vision research. Therefore, following the trends in biological vision, the computer vision R&D for I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong 99 decades has been deeply involved in bottom-up pixel-oriented image processing. Low-level image computations have become its prime and persistent goal, while the complicated issues of high-level processing were just neglected and disregarded. However, it is impossible to ignore them completely. It is generally acknowledged that any kind of image processing is unfeasible without incorporation into it the high-level knowledge ingredients. For this reason, the whole history of computer-based image processing is an endless saga on attempts to seize the needed knowledge in any possible way. The oldest and the most common ploy is to capitalize on the expert domain knowledge and adapt it to each and every application case. It is not surprising, therefore, that the whole realm of image processing has been (and continues to be) fragmented (segmented) according to high-level knowledge competence of the domain experts. That is why we have today: medical imaging, aerospace imaging, infrared, biologic, underwater, geophysics, remote sensing, microscopy, radar, biomedical, X-ray, and so on “imagings”. The advent of the Internet, with huge volumes of visual information scattered over the web, has demolished the long-lasting custom of capitalizing on the expert knowledge. Image information content on the Web is unpredictable and diversified. It is useless to apply specific expert knowledge to a random set of distant images. To meet the challenge, the computer vision community has undertaken an enterprise to develop appropriate (so- called) Content-Based Image Retrieval (CBIR) technologies (Lew et al. 2006). However, deprived of any reasonable sources of the desired high-level information, computer vision designers were forced to proceed in the only one possible direction – trying to derive the high-level knowledge from the available low-level information pieces (Mojsilovic & Rogowitz, 2001; Zhang & Chen, 2003). It will be a mistake to say that computer vision people are not aware of these discrepancies. On the contrary, they are well informed about what is going on in the field. However, they are trying to justify their attempts by promoting a concept of a “semantic gap”, an imaginary gap between low- and high-level image features. They sincerely believe that some day they would be able to bridge over it (Hare et al., 2006). It is worth to mention that all these developments (feature binding in biological vision and semantic gap bridging in computer vision) are evolving in an atmosphere of total indifference towards preceding claims about high-level information superiority in the general course of visual information processing. Such indifference seems to stem from a very loose understanding about what is the concept of “information”, what is the right way to use it properly, and what information treatment options could arise from this understanding. 3. Trying to define “What is information?” I was very proud of myself when it has become clear to me that the problem image processing is subjected to stems from misunderstanding and confusing the duties that machine vision and human vision systems are destined to perform: machine vision systems are for data processing, human vision systems – for information processing. It was clear to me that data and information are different things, and therefore a careless blending of them is harmful and counterproductive (as it follows from the examples provided above). However, my conjectures have not been readily welcomed. My paper submitted to BMCV 2002 Conference was rejected, and the reviewer was very strict in his comments: “The Frontiers in Brain, Vision and AI 100 distinction between information and data processing is superficial – you have to be more specific (after all, data is information, isn’t it?)”. I was hurt by what has seemed to me as reviewer’s ignorance. But later I was forced to learn that that is a well-established, widespread and quite common view on the matters. Luciano Floridi’s papers (Floridi, 2003; Floridi 2005; Floridi 2007) are busy with refining “the Standard Definition of semantic information as meaningful data” (!!!). Alas, you cannot quarrel with Floridi. Especially, as your own definition is so vague and muddle-headed that it is better for you to take a stance that “information” is an indefinable entity, like “time” or “space” in classical physics. (Later I have found out that a similar stance is taken by Aaron Sloman (Sloman, 2006) when he compares the indefinable notion of “information” with the indefinable notion of “energy”). Following my own intuition, I have finally hit on something I was so desperately looking for – an information definition fitting my image processing requirements. It turns out that this definition can be derived from Solomonoff’s theory of Inference (Solomonoff, 1997), Chaitin’s Algorithmic Information theory (Chaitin, 1977), and Kolmogorov’s Complexity theory (Kolmogorov, 1965). The results of my investigation have been already published on several occasions, (Diamant, 2003; Diamant, 2004; Diamant, 2005; Diamant, 2007), and interested readers can easily get them from a number of freely accessible repositories (e.g., arXiv, CiteSeer (the former Research Index), Eprintweb, etc.). Therefore, I will only repeat here some important points of these early publications, which properly reflect my current understanding of the matters. The main point is that information is a description, a certain alphabet-based or language- based description, which Kolmogorov’s theory regards as a program that, being executed, trustworthy reproduces the original object (Vitany, 2006). In an image, such objects are visible data structures from which an image consists of. So, a set of reproducible descriptions of image data structures is the information contained in an image. The Kolmogorov’s theory prescribes the way in which such descriptions must be created: at first, the most simplified and generalized structure must be described. (Recall the Occam’s Razor principle). Then, as the level of generalization is gradually decreased, more and more fine-grained image details (structures) become revealed and depicted. This is the second important point, which follows from the theory’s pure mathematical considerations: image information is a hierarchy of recursive decreasing level descriptions of information details, which unfolds in a coarse-to-fine top-down manner. (Attention, please: any bottom- up processing is not mentioned here. There is no low-level feature gathering and no feature binding!!! The only proper way for image information elicitation is a top-down coarse-to- fine way of image processing.) The third prominent point, which immediately pops-up from the two just mentioned above, is that the top-down manner of image information elicitation does not require incorporation of any high-level knowledge for its successful accomplishment. It is totally free from any high-level guiding rules and inspirations. That is why I call it Physical Information – information that is totally independent of any high level interpretation of it. What immediately follows from this is that high-level image semantics is not an integrated part of image information content (as it is traditionally assumed). It cannot be seen more as a natural property of an image. Image semantics, therefore, must be seen as a property of a human observer that watches and scrutinizes an image. That is why we can definitely say: I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong 101 semantics is assigned to an image by a human observer. That is strongly at variance with the contemporary views on the concept of semantic information. Following the new information elicitation rules, it is impossible to continue to pretend that semantics can be extracted from an image, (as for example in (Naphade & Huang, 2002)), or should be derived from low-level information features (as in (Zhang & Chen, 2003; Mojsilovic & Rogowitz, 2001), and many other analogous publications). That simply does not hold any more. 4. Reification of the proposed idea The new definition of information has forced us to reconsider the traditional way of doing things in image processing. The inevitable change in design philosophy, the validity of new assumptions, the consequences that acceptance of new assumptions imply, all this has motivated us to test the proposed novelties in a framework of visual robot design enterprise – an enterprise, which is aimed on creating an artificial vision system with some human-like cognitive capabilities. As follows from the preceding discussion, the proposed arrangement must be comprised of two separate loosely coupled parts: Physical Information processing part and Semantic Information processing part. The proposed block-scheme of this arrangement is depicted in Fig. 1. 4.1 Physical information processing The purpose of the Physical Information processing part is to extract the physical information buried in the image data. That is, to provide a description of discernable image data structures present in a given image. In simple words, to provide an initial segmentation of the input image. Afterwards the segmented pieces would be submitted to a process of image analysis and interpretation (in terms of our approach – Semantic Information would be assigned to the input image). As one can see, the proposed Physical Information processing part is comprised of three sub-units: the bottom-up processing path, the top-down processing path and a stack where the discovered information content (the generated descriptions of it) are actually accumulated. (More details about Physical Information processing can be found in (Diamant, 2004; Diamant, 2005; Diamant, 2005a). As follows from the early-defined information processing principles (which prescribe that the most general and simplified descriptions have to be derived first), the purpose of the bottom-up processing path is to provide a simplified (compressed, squeezed) copy of an input image. The original image is squeezed along this path to a small size of approximately 100 pixels. The rules of this shrinking operation are very simple and fast: four non- overlapping neighbor pixels in an image at level L are averaged and the result is assigned to a pixel in a higher (L+1)-level image, (a so-called 4 to 1 image compression). At the top of the shrinking pyramid, the image is segmented, and each segmented region is labeled. Since the image size at the top is significantly reduced and since in course of the bottom-up image squeezing a severe data averaging is attained, the image segmentation/classification procedure does not demand special computational efforts. From this point on, the top-down processing path is commenced. At each level, the segmentation maps (intensity and region labels) are expanded to the size of an image at the Frontiers in Brain, Vision and AI 102 nearest lower level, (a 1 to 4 expansion). Since the regions at different hierarchical levels do not exhibit significant changes in their characteristic intensity, the majority of newly assigned pixels are determined in a sufficiently correct manner. Only pixels at region borders and seeds of newly emerging regions may significantly deviate from the assigned values. Taking the corresponding current-level image as a reference (the left-side unsegmented image), these pixels can be easily detected and subjected to a refinement cycle. The region labels map is corrected accordingly. In such a manner, the process is subsequently repeated at all descending levels until the segmentation of the original input image is successfully accomplished. At each processing level, every segmented image object-region (whether just recovered or an inherited one) is registered in the objects’ appearance list (the Stocked Level Descriptions rectangle in Fig. 1), which is the third constituting part of the proposed scheme. The registered object parameters are the available simplified object’s attributes, such as size, center-of-mass position, average object intensity and hierarchical and topological relationship within and between the objects (“sub-part of…”, “at the left of…”, etc.). They are sparse, general, and yet specific enough to capture the object’s characteristic features in a variety of descriptive forms. This way, a practical algorithm based on the announced above principles has been developed and subjected to some systematic evaluations. The results were published, and can be found in (Diamant, 2004; Diamant, 2005; Diamant, 2005a). There is no need to repeat again and again that excellent, previously unattainable segmentation results have been attained in these tests, undoubtedly corroborating the new information processing principles. Not only an unsupervised segmentation of image content has been achieved, (in a top-down coarse-to-fine processing manner, without any involvement of high-level knowledge), a hierarchy of descriptions for each and every segmented lot (segmented sub- object) has been achieved as well. It contains a set object related parameters, which enable subsequent object reconstruction. That is exactly what we have previously defined as information. That is the reason why we specify this information as “physical information”, because that is the only information present in an image, and therefore the only information that can be extracted from an image. 4.2 Semantic information processing Semantic information, which (as we understand now) conveys the property of an external observer, is completely dissociated from the physical information contained in an image. Therefore it must be treated (or modeled) in accordance with observer-specific (his/her) cognitive information processing rules. What are these rules? A consensus view on this topic does not exist as yet in the biological vision theories as well as in the computer vision practice. So, we have to blaze our own trails. We decided, thus, to meet this challenge by suggesting a new approach based on our previously declared information elicitation principles. The preliminary results of our first attempt have been published elsewhere (Diamant, 2006). As in the case of physical information, we will not repeat here all the details of this publication. Possible implementation details of the Semantic Information processing part (solution) are depicted in Fig. 1. Here we will proceed only with a brief explanation of some of them. I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong 103 Figure 1. Arrangement of Physical and Semantic Information Hierarchies and their interconnection Frontiers in Brain, Vision and AI 104 Human’s cognitive abilities (including the aptness for image interpretation and the capacity to assign semantics to an image) are empowered by the existence of a huge knowledge base about the things in the surrounding world kept in human brain/head. This knowledge base is permanently upgraded and updated during the human’s life span. So, if we intend to endow our visual robot with some cognitive capabilities we have to provide it with something equivalent to this (human) knowledge base. It goes without saying that this knowledge base will never be as large and developed as its human prototype. But we are not sure that such a requirement is valid in our case. After all, humans are also not equal in their cognitive capacities, and the content of their knowledge bases is very diversified too. (The knowledge base of aerial photographs interpreter is certainly different from the knowledge base of X-ray images interpreter, or IVUS images, or PET images). The knowledge base of our visual robot has to be small enough to be effective and manageable, but sufficiently large to ensure the robot’s acceptable performance. Certainly, for our feasibility study we can be satisfied even with a relatively small, specific- task-oriented knowledge base. The next crucial point is the knowledgebase representation issue. To deal with it, we first of all must arrive at a common agreement about what is the meaning of the term “knowledge”. (A question that usually has no commonly accepted answer.) We state that in our case a suitable and a sufficient definition of it would be: “Kownledge is a memorized information”. Consequently, we can say that knowledge (like information) must be a hierarchy of descriptive items, with the grade of description details growing in a top-down manner at the descending levels of the hierarchy. What else must be mentioned here, is that these descriptions have to be implemented in some alphabet (as it is in the case of physical information) or in a description language (which better fits the semantic information case). Any farther argument being put aside, we will declare that the most suitable language in our case is the natural human language. After all, the real knowledge bases that we are familiar with are implemented in natural human languages. The next step, then, is predetermined: if natural language is a suitable description implement, the suitable form of this implementation is a narrative, a story tale (Tuffield et al., 2005). If the description hierarchy can be seen as an inverted tree, then the branches of this tree are the stories that encapsulate human’s experience with the surrounding world. And the leaves of these branches are single words (single objects) from which the story parts (single scenes) are composed of. The descent into description details, however, does not stop here, and each single word (single object) can be farther decomposed into its attributes and rules that describe the relations between the attributes. At this stage the physical information reappears. Because the words are usually associated with physical objects in the real world, words’ attributes must be seen as memorized physical information (descriptions). Once derived (by the HVS) from the observable world and learned to be associated with a particular word, these physical information descriptions are soldered in into the knowledgebase. Object recognition, thus, turns out to be a comparison and similarity test between currently acquired physical information and the one already retained in the memory. If the similarity test is successful, starting from this point in the hierarchy and climbing back up on the knowledgebase ladder we will obtain: first, the linguistic label for a recognized object; second, the position of this label (word) in the context I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong 105 of the whole story; and third, the ability to verify the validity of an initial guess by testing the appropriateness of the neighboring parts composing the object or the context of a story. In this way, object’s meaningful categorization can be reached, and the first stage of image annotation can be successfully accomplished, providing the basis for farther meaningful (semantic) image interpretation. One question has remained untouched in our discourse: How this artificial knowledgebase has to be initially created and brought into the robot’s disposal? The vigilant reader certainly remembers the fierce debates about learning capabilities of neural networks and other machine learning technologies. We are aware of these debates. But in our case we can state certainly: they are irrelevant. For a simple reason: the top-down fashion of the knowledge base development pre-determines that all responsibilities for knowledge base creation have to be placed on the shoulders of the robot designer. Such an unexpected twist in design philosophy will be less surprising if we recall that human cognitive memory is also often defined as a “declarative memory”. And the prime mode of human learning is the declarative learning mode, when the new knowledge is explicitly transferred to a developing human from his external surrounding: From a father to a child, from a teacher to a student, from an instructor to a trainee. So, our proposal that robot’s knowledgebase has to be designed and created by the robot supervisor is sufficiently correct and is fitting our general concept of information use and management. 5. More explanation is required The proposed Semantic Information Processing scheme must be so annoyingly different from other knowledge-management forms that farther explanations in its defence must be provided. The vigilant reader has certainly paid attention to the fact that the term “ontology” does not appear in the text, albeit ontology is a ubiquitously used technique for human knowledgebase creation and representation. I have chose to avoid the use of the term ontology for the following reason. More than twenty years ago, a famous Soviet mathematician, Israel Gelfand, and his colleagues were trying to devise a knowledge-based system for medical diagnostic problem solving. From the very beginning, the need for an adequate description language has become apparent, and extensive research efforts were spent moving toward this goal. The notion of ontology has not been yet known – the seminal paper of Thomas Gruber would appear only in the year 1993 (Gruber, 1993). However, in the preface to the book that summarizes their experience, which was well ahead of their time, while referring to the language creation difficulties, Israel Gelfand writes: “There are two ways to create a language: to compose literature scripts or to compile a dictionary. We all know how significant for the Russian language were the works of Pushkin and Dhale” (Gelfand et al., 1989). (We would add accordingly – Shakespeare and Dr. Johnson, for the English language). The problem is that in contemporary knowledge-based systems design, ontology is used in only one of its manifestations – a vocabulary, a thesaurus. That is, certainly, a miss and a fault, which in our design we are trying to avoid. Imagine a Martian guest that is trying to understand our world relying only on the Oxford Concise Dictionary. On the other hand, you can easily recall the picture books with stories that the grandmother has read to you again and again in the childhood. Frontiers in Brain, Vision and AI 106 The story telling approach that we decided to pursue (and are trying to implement) is also very different from those that could be found in today’s research papers. Current trend in story telling research and development is focused on automatic narrative creation, very similar to what is going on in the classical ontology design practice. In this regard it would be a proper place to remind that we reject the tradition of autonomous ontology creation. We are inclined to the “grandmother approach”, where, as it was already explained earlier, the new knowledge comes to its possessor from the outside, from someone who already possesses it: A grandmother telling the child her stories, dancing bees that convey to the rest of the hive the information about melliferous sites (Zhang et al., 2005), ants that learn in tandem (Franks & Richardson, 2006), and even bacteria developing their antibiotic resistance as a result of a so-called horizontal gene transfer when a single DNA fragment of one bacteria is disseminated among other colony members (Lawrence & Hendrickson, 2003). That is, in our case this is a job for the robot ‘s designer. In a story telling manner he has to transfer to the robot his view on the surrounding world and his understanding of a proper behavior in different task-inspired situations. I am aware that by denying the bottom-up machine-learning-inspired knowledge acquisition I am awaking all the bears in my environment. But sorry, that is only an attempt to find out the way to leave the dead-ended alley where image processing is stalled for so many years. Let us continue: Vigilant readers have certainly also paid attention to the fact that the name of Claude Shannon (the famous inventor of the Information Theory of Communication) is not mentioned in the paper. The reason for this is clear and plain – Shannon says nothing about the notion of information, about “What is information?” He has invented a measure of information, but that says nothing about the notion of information. Like the measure of time, which we ubiquitously use (second, hour, day, etc.) tells nothing about the notion of time, about “What is time?”. Kolmogorov too was busy with very different things. Randomness has been his main concern. According to the Kolmogorov’s theory, a message composed as a sequence of random values cannot be depicted (reproduced) by a description program, which is shorter than the original message. That is, the description of a random message is the message itself. What follows from this, is that nonrandom data structures could be described in a concise compressed form, which Chaitin calls “Algorithmic Information” (Chaitin, 1977), Floridi – “Meaningful data” (Floridi, 2005), Vitanyi – “Meaningful Information” (Vitanyi, 2006). That means that each message can be seen as a composition of: a compressible, information- bearing part of it and a non-compressible, information-devoid, random data part. The first part we call Physical Information, and it is obvious that processing only this part of the message will give us a tremendous gain against the data processing case where meaningful and meaning-less data are inseparable. The March 2008 issue of the IEEE Signal Processing Magazine is entirely devoted to this problem: in different domains of signal processing people have empirically discovered the advantages of what they call “Compressive Sampling”. In the preface to the magazine the guest editors write: “At the heart of the new approach are two crucial observations. The first is that the Shannon/Nyquist signal representation exploits only minimal prior knowledge about the signal being sampled, namely its bandwidth. However, most objects we are interested in acquiring are structured and depend upon a small number of degrees of freedom than the bandwidth suggests. In other words, most objects of interest are sparse or I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong 107 compressible in the sense that they can be encoded with just a few numbers without numerical or perceptual loss”. Bravo! There could be no better explanation to the benefits of information processing versus brute force data processing. The tradition is, however, stronger than the reason – the rest of the magazine is devoted to the alchemy of compressive sampling accomplishment via bottom-up raw data processing. Some words I would like to spend on the latest developments in the HVS research. While the mainstream of human vision research continues to approach visual information processing in a bottom-up feed-forward fashion (Serre et al., 2005; Kveraga et al., 2007) it turns out that the idea of primary top-down processing was never extraneous to biological vision. The first publications addressing this issue are dated by the early eighties of the last century, (Navon, 1977; Chen, 1982). The prominent authors were persistent in their claims, and farther research reports were published regularly until the recent time, (Navon, 2003; Chen, 2005). However, it looks like they have been overlooked, both in biological and in computer vision research. Only in the last years, a tide of new evidence has become visible and is pervasively discussed now. Although the spirit of these discussions is still different from our view on the subject, the trend is certainly in favor of the foremost top-down visual information processing (Ahissar & Hochstein, 2004; Juan et al., 2004). Again, top-down information processing in the physical information processing part only is assumed here. Information processing partition proposed in this paper is not acknowledged by the contemporary vision researchers. 6. Some conclusions In this paper, I have proposed a few ideas that are entirely new and therefore might look suspicious. All the novelties come as a natural extension of a new definition of information that is sequentially applied to various aspects of image processing. The most important innovation is positing information image processing as the prime mode of image processing (in contrast to traditionally dominant data image processing). The next novelty is the dissociation between physical and semantic information processing within the information- processing domain. The proposed arrangement of information-processing hierarchies is a further extension of the basic idea of the information-processing nature of the HVS, and its imitation in an artificial vision system – our hypothetical visual robot design. Despite of the skeptical welcome, the efficiency of the unsupervised top-down directed region-based image segmentation is hard to disprove today. Although the story telling approach to knowledgebase hierarchy creation is not yet so rigorously proved, we hope that this development stage will also be successfully surmounted. I hope that the time of our persuasive success is not far away. 7. References Ahissar, M. & Hochstein, S. (2004). The reverse hierarchy theory of visual perceptual learning, Trends in Cognitive Science, vol. 8, no. 10, pp. 457-464, 2004. Barsalou, L.W. (1999). Perceptual symbol systems, Behavioral and Brain Sciences, vol. 22, pp. 577-660, 1999. Biederman, I. (1987). Recognition-by-Components: A Theory of Human Image Understanding, Psychological Review, vol. 94, no. 2, pp. 115-147, 1987. Frontiers in Brain, Vision and AI 108 Biederman, I. (2006). Perceptual Pleasure and the Brain, American Scientist, vol. 94, pp. 249- 255, May-June 2006. BittWare. (2007). Available: http://www.sarsen.net/sarsen-manufacture-bitware-standard- amc-b2.html. Chaitin, G. J. (1977). Algorithmic Information Theory, IBM Journal of Research and Development, vol. 21, pp. 350-359, 1977. Chen, L. (1982). Topological structure in visual perception, Science, 218, pp. 699-700, 1982. Chen, L. (2005). The topological approach to perceptual organization, Visual Cognition, vol. 12, no. 4, pp. 553-637, 2005. Diamant, E. (2004). Top-Down Unsupervised Image Segmentation (it sounds like an oxymoron, but actually it isn’t), Proceedings of the 3rd Pattern Recognition in Remote Sensing Workshop (PRRS’04), Kingston University, UK, August 2004. Diamant, E. (2005). Searching for image information content, its discovery, extraction, and representation, Journal of Electronic Imaging, vol. 14, issue 1, January-March 2005. Diamant, E. (2005a). Does a plane imitate a bird? Does computer vision have to follow biological paradigms?, In: De Gregorio, M., et al, (Eds.), Brain, Vision, and Artificial Intelligence, First International Symposium Proceedings. LNCS, vol. 3704, Springer- Verlag, pp. 108-115, 2005. Available: http://www.vidiamant.info. Diamant, E. (2006). In Quest of Image Semantics: Are We Looking for It Under the Right Lamppost?, http://arxiv.org/abs/cs.CV/0609003. Diamant, E. (2007). Modeling human-like intelligent image processing: An information processing perspective and approach, Signal Processing: Image Communication, vol. 22, pp.583-590, 2007. European IST Research (2005-2006): Building on Assets, Seizing Opportunities. Available: http://europa.eu.int/information_society/. Floridi, L. (2003). From Data to Semantic Information, Entropy, vol. 5, pp. 125-145, 2003. Floridi, L. (2005). Is Semantic Information Meaningful Data? Philosophy and Phenomenological Research, vol. LXX, no. 2, pp. 351-370, March 2005. Floridi, L. (2007). In defence of the veridical nature of semantic information, European Journal of Analytic Philosophy, vol. 3, no. 1, pp. 31-41, 2007. Floridi, L. (2007). Trends in the Philosophy of Information, In: P. Adriaans, J. van Benthem (Eds.), “Handbook of Philosophy of Information”, Elsevier, (forthcoming). Available: http://www.philosophyofinformation.net. Franks, N. & Richardson, T. (2006). Teaching in tandem-running ants, Nature, 439, p. 153, January 12, 2006. Gelfand, I.M.; Rosenfeld, B.I.; Shifrin, M.A. (1989). Essays on Collaboration of Mathematicians and Physicians, Nauka Pablisher, 1989. Gruber, T.R. (1993). Toward Principles for the Design of Ontologies Used for Knowledge Sharing, In: Formal Ontology in Conceptual Analysis and Knowledge Representation, Kluwer Publisher, 1993. Avl.: http://kls-web.stanford.edu/authorindex/Gruber. Hare, J., Lewis, P., Enser, P., and Sandom, C. (2006). Mind the Gap: Another look at the problem of the semantic gap in image retrieval, Proceedings of Multimedia Content Analysis, Management and Retrieval Conference, SPIE vol. 6073, 2006. Available: http://www.ecs.soton.ac.uk/people/. Itti, L. (2005). Models of Bottom-Up Attention and Saliency, In: Neurobiology of Attention, (L. Itti, G. Rees, J. Tsotsos, Eds.), pp. 576-582, San Diego, CA: Elsevier, 2005. I’m Sorry to Say, But Your Understanding of Image Processing Fundamentals Is Absolutely Wrong 109 Johansson, C. & Lansner, A. (2006). Attractor Memory with Self-organizing Input, Workshop on Biologically Inspired Approaches to Advanced Information Technology (BioADIT 2005), LNCS, vol. 3853, pp. 265-280, Springer-Verlag, 2006. Juan, C-H.; Campana, G. & Walsh, V. (2004). Cortical interactions in vision and awareness: hierarchies in reverse, Progress in Brain Research, vol. 144, pp. 117-130, 2004. Kolmogorov, A. (1965). Three approaches to the quantitative definition of information, Problems of Information and Transmission, vol. 1, No. 1, pp. 1-7, 1965. Kveraga, K.; Ghuman, A. & Bar, M. (2007). Top-down predictions in the cognitive brain, Brain and Cognition, vol. 65, pp. 145-168, 2007. Lawrence, J. & Hendrickson, H. (2003). Lateral gene transfer: when will adolescence end?, Molecular Microbiology, vol. 50, no. 3, pp. 739-749, 2003. Lew, M.S., Sebe, N., Djeraba, C. and Jain, R. (2006). Content-based Multimedia Information Retrieval: State of the Art and Challenges, In: ACM Transactions on Multimedia Computing, Communications, and Applications, February 2006. Marques, O. & Furht, B. (2002). Content-Based Visual Information Retrieval, In: (T.K. Shih, Ed.), Distributed Multimedia Databases: Techniques and Applications, Idea Group Publishing, Hershey, Pennsylvania, 2002. Marr, D. (1978). Representing visual information: A computational approach, Lectures on Mathematics in the Life Science, vol. 10, pp. 61-80, 1978. Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information, Freeman, San Francisco, 1982. McRae, K. (2004). Semantic Memory: Some insights from Feature-based Connectionist Attractor Networks, Ed. B. H. Ross, The Psychology of Learning and Motivation, vol. 45, 2004. Available: http://amdrae.ssc.uwo.ca/. Milner, D. & Goodale, M. (1998). The Visual Brain in Action, Oxford Psychology Series, No. 27, Oxford University Press, 1998. Mobile video. (2008). Available: http://www.dspdesignline.com/howto/207100795. Mojsilovic, A. & Rogowitz, B. (2001). Capturing image semantics with low-level descriptors, In: Proceedings of the International Conference on Image Processing (ICIP-01), pp. 18-21, Thessaloniki, Greece, October 2001. Naphade, M. & Huang, T.S. (2002). Extracting Semantics From Audiovisual Content: The Final Frontier in Multimedia Retrieval, IEEE Transactions on Neural Networks, vol. 13, No. 4, pp. 793-810, July 2002. Navon, D. (1977). Forest Before Trees: The Precedence of Global Features in Visual Perception, Cognitive Psychology, 9, pp. 353-383, 1977. Navon, D. (2003). What does a compound letter tell the psychologist’s mind?, Acta Psychologica, vol. 114, pp. 273-309, 2003. Nokia. (2008). Available: http://en.wikipedia.org/wiki/Nokia. Palmeri, T. & Gauthier, I. (2004). Visual Object Understanding, Nature Reviews: Neuroscience, vol. 5, pp. 291-304, April 2004. Puri, A. & Eleftheriadis, A. (1998). MPEG-4: An object-based multimedia coding standard, Mobile Networks and Applications, vol. 3, issue 1, pp. 5-32, 1998. Serre, T.; Kouh, M.; Cadieu, C.; Knoblich, U.; Kreiman, G. & Poggio, T. (2005). A theory of object recognition: computations and circuits in the feedforward path of the ventral stream in primate visual cortex, CBCL MIT paper, November 2005. (Available: http://web.mit.edu/serre/...) Frontiers in Brain, Vision and AI 110 Sloman, A. (2006). What is information? Meaning? Semantic content?, Available: http://www.cs.bham.ac.uk/research/projects/cosy/papers/. Solomonoff, R. J. (1997). The Discovery of Algorithmic Probability, Journal of Computer and System Science, vol. 55, No. 1, pp. 73-88, 1997. Sullivan, G. & Wiegand, T. (2005). Video Compression – From Concepts to the H.264/AVC Standard, Proceedings of the IEEE, vol. 93, no. 1, pp. 18-xx, January 2005. Thevenin, M., Paindavoine, M., Letellier, L., Heyrman, B. (2008). Embedded processor extensions for image processing, Proceedings of SPIE, vol. 7001, April 2008. Treisman, A. & Gelade, G. (1980). A feature-integration theory of attention, Cognitive Psychology, vol. 12, pp. 97-136, Jan. 1980. Treisman, A. (1996). The binding problem. Current Opinion in Neurobiology, vol. 6, pp.171- 178, 1996. Treue, S. (2003). Visual attention: the where, what, how and why of saliency, Current Opinion in Neurobiology, vol. 13, pp. 428-432, 2003. Tuffield, M.; Shadbolt, N. & Millard, D. (2005). Narratives as a Form of Knowledge Transfer: Narrative Theory and Semantics, Proceedings of the 1st AKT (Advance Knowledge Technologies) Symposium, Milton Keynes, UK, June 2005. Video Surveillance. (2007). Networking/IP to drive video surveillance market growth. Available: http://semiconductors.tekrati.com/research/8608/. Vitanyi, P. (2006). Meaningful Information, IEEE Transactions on Information Theory, vol. 52, No. 10, pp. 4617-4624, October 2006. Availbl: http://www.cwi.nl/~paulv/papers. Zhang, C. & Chen, T. (2003). From Low Level Features to High Level Semantics, In: Handbook of Video Databases: Design and Applications, by Furht, Borko/ Marques, Oge, Publisher: CRC Press, October 2003. Zhang, S.; Bock, F.; Si, A.; Tautz, J. & Srinivasan, M. (2005). Visual working memory in decision making by honey bees, Proceedings of The National Academy of Science of the USA (PNAS), vol. 102, no. 14, pp. 5250-5255, April 5, 2005. Zhou, X.S. & Huang, T.S. (2000). CBIR: From low-Level Features to High-Level Semantics, Proceedings SPIE, vol. 3974, pp. 426-431, San Jose, CA, January 24-28, 2000. Available: http://www.ifp.uiuc.edu/~xzhou2/.