Difficulty in Grouping Process Dissociation
Perceptual grouping is the process by which raw image elements are aggregated into larger and more meaningful collections. Grouping is widely assumed to be early, automatic, and preattentive, though the extent to which grouping can proceed without attention is controversial. Grouping and scene organization can impose decisive influences on other low-level processes. Grouping is a necessary precursor to object recognition, because for complexity reasons only well-organized groups, rather than arbitrary subsets of the image, can be compared against as stored object models. Nevertheless, grouping is certainly one of the least understood problems in vision. This state of affairs reflects the difficulty of precisely formalizing subtle human intuitions about the relative “reasonableness” of candidate groups.
Indeed, notwithstanding the rapidity and effortlessness with which human perceivers perform it, grouping is an extremely difficult problem from a computational point of view. The number of candidate groups in a configuration of n items is equal to the number of subsets and hence is exponential (2n); the number of partitions (divisions of the n items into disjoint subsets) is a far larger exponential function of n. Many early grouping phenomena, such as the detection of collinearity, are often treated by researchers as local problems in a restricted neighborhood, thus reducing the amount of computation required. However, the more general problem of grouping is well known to involve global effects. Long-distance influences over large areas of the image are common, meaning the fundamental complexity remains extremely high (a fact reflected in the very term “Gestalt”, connoting the primacy of the whole). Perhaps the best illustration of the difficulty is the fact that in computational vision, it has become commonplace to require a human user to outline target shapes in images before recognition or motion tracking can commence, because existing grouping algorithms do not provide sufficiently robust or accurate results. The lack of good algorithms in turn reflects the failure of psychologists to propose a theory rigorous and concrete enough to be implemented computationally.
Yet the real theoretical difficulty in grouping stems from the difficulty in clearly defining the computational goal: a rigorous definition of what makes a “good group”. Unlike such physically grounded variables as depth, color, and motion, goodness of grouping candidates does not have an objective physical definition. Some ways of combining image elements simply seem more intuitively reasonable than others. The Gestaltists called this elusive quality of perceptual goodness Prägnanz, usually translated as “good form”.
Two general strategies for attacking this problem in the literature can be distinguished. Loosely, some authors seek to explain the procedure by which the visual system arrives at its preferred percept – i.e., find a process model – while others attempt to characterize the nature of the preferred percept itself (cf. the distinction between dynamic and static approaches noted by Van der Helm & Leeuwenberg (1996)). The distinction is related to Marr’s well-known division between an algorithmic theory and a theory of the computation, the latter sometimes referred to as a competence theory following Chomsky’s terminology. As such the two approaches operate at distinct but mutually compatible levels of analysis. The research described in the current paper places the emphasis on the competence theory, on the belief that trying to discover how the visual system computes something – without first defining that thing – amounts to letting the tail wag the dog.
Hence, this paper focuses on an attempt to define in formal terms exactly which interpretation for a given scene is most preferred by human observers, and why. Mathematical details and computational issues in the theory, called minimal model theory, are explained in more detail elsewhere. The emphasis here will be on one particular issue: the role of grouping units. What kind of groups – contours, surfaces, objects etc. – are image items aggregated into, and why? In particular I will attempt to shed light on the somewhat amorphous concept of “object”, the grouping unit most difficult to define and hence, perhaps, most in need of a rigorous theory.
In the common wisdom, perceptual grouping is the process whereby the visual image is decomposed into objects. However, this definition is somewhat at odds with the way perceptual grouping is studied in practice by researchers in the field. More commonly, research has centered around how visual items are organized into striated patterns, contours, and Moiré patterns. Researchers studying perceptual completion behind a subjective occluder or a visible occluder have usually conceptualized the completed thing as a simple surface. Such an object though is at most a very simple one, consisting of only a single closed region, and almost invariably 2D. The computational literature has also focused primarily on contours and surfaces. In the human vision literature in general, there is a widespread view promulgated by Gibson (1979) that surfaces rather than objects are the primary unit of visual representation.
Objects per se have been little studied in the context of grouping. For the most part this probably stems from the difficulty in precisely defining them. Contours always have a certain well-defined geometrical form: they are 1D space curves, i.e., smooth deformations of the unit line. Similarly, surfaces are always smooth deformations of a neighborhood of the plane. Many objects are simply 3D analogs of contours and surfaces: smoothly bounded regions of 3D space (i.e., “blobs”). In general though objects can be more complex than this, having parts and articulated substructures, and potentially complex spatial relations within them. Given the difficulty in completely characterizing human grouping preferences even for these geometrically simpler units, grouping researchers have not often approached the more abstract problem of objects directly.
Reiter, R. and Mackworth, A. K., 1989. A logical framework for depiction and image interpretation. Artificial Intelligence 41, pp. 125–155. Abstract | MathSciNet | View Record in Scopus | Cited By in Scopus (31)Reiter and Mackworth (1989) (see also Clowes, 1971) have proposed a definition of an interpretation using ideas from mathematical logic, a field in which the idea of enumerating the alternative interpretations of a fixed set of facts is a central concept. Their definition is quite technical. The discussion here follows their definition only loosely, and is oriented specifically around the idea of choosing grouping units.
- June 27th