
Multimodal generative AI and accessibility
AI Applications to support Communication
by Gino Roncaglia
In very general terms, accessibility refers to the ability of an environment, a device, a service, content, or a resource to be usable in the easiest, most effective, and most complete way by the widest possible range of users, regardless of their individual and collective characteristics and abilities, including ethnic background, socioeconomic conditions, cognitive styles, any disabilities or different abilities, and the type of access tools used [1].
Accessibility in the IT field (including accessibility of network content, resources, and services) thus represents only a subset of a broader field, and it should be noted that IT tools can also be used to guarantee access to physical environments, services, or resources: for example, an app designed for blind users that uses a smartphone camera to collect visual information about the surrounding environment and convert it into sound information can be used in a variety of everyday life situations, including interaction with physical environments (crossing a street, visiting a museum…).
A particularly significant component (although certainly not the only one) of the work done in the field of accessibility concerns precisely the compensation for difficulties related to the perception or interpretation of a communicative code through the use of information based on other communicative codes. Thus, to name just a few of the many possible examples, an audio description can assist in the enjoyment of visual content by blind people, a text transcription can help in the enjoyment of audio content by deaf people, a translation, a simpler reformulation, or a visual description can help to understand complex textual information by people with insufficient language skills compared to the source language, and so on. The availability of tools that allow the transfer of informational content from one expressive mode to another, and do so automatically, quickly, and with qualitative effectiveness, therefore represents a revolutionary innovation.
It should be added that accessibility often has a contextual nature, which traditional compensatory tools do not always manage adequately. Consider, for example, the textual description of a museum artifact. This description has a primary task of a fairly general level: to present and make the artifact 'interpretable' by a non-specialist visitor. In some cases (but rarely in all) a museum may also offer one or more additional descriptions: for example, a Braille description intended for blind people, and/or an “easy” description for girls and boys, and/or a more detailed description, for instance through a guide or an audio guide. Each of these descriptive levels has its own characteristics and addresses – to borrow a term widely studied in the semiotic field – its own 'ideal reader'. But actual visitors offer a spectrum of situations and needs much broader. There may be visitors of various ages, from different backgrounds and interests, with specific curiosities, with different levels of prior knowledge and skills. Clearly, no museum in the world could personalize its informational tools to such a level of differentiation and detail to meet these multiple situations and needs. An artificial intelligence system, on the other hand, could turn into a sort of customizable 'guide', capable of interacting with the visitor to determine the most appropriate communicative code, type, and level of description each time and then generate its output based on a specific context.
To get an idea of the possibilities in this field, I provided ChatGPT 4V (the V indicates the presence of the visual model) with a file containing the image of the painting The Starry Night by Vincent van Gogh, without specifying in the prompt the nature of the image but only asking it to provide a description. ChatGPT immediately recognized the painting and returned the following description:
The image you uploaded is a reproduction of the famous painting The Starry Night by Vincent van Gogh. This masterpiece was created in 1889 and represents a view from the window of the Saint-Rémy-de-Provence psychiatric hospital where van Gogh was hospitalized. The painting is known for its turbulent and dream-like sky, rich with swirling shining stars and a bright crescent moon. The brushstrokes are thick and dynamic, creating a movement that seems to animate the night sky. The sky is dominated by shades of blue and yellow, with touches of white for the stars and the moon.
In the lower part of the painting, the landscape is calmer and more orderly. There are rolling hills, a church with a tall and slender bell tower silhouetted against the sky, and low houses with windows that look like small pictures of light in the night. A tall and sinuous cypress, typical of the southern French landscape, rises in the foreground, serving as a bridge between the earth and the sky.
The earth tones are more muted compared to the sky, with various shades of green, blue, and brown giving a sense of quiet contrasting with the turmoil above. The composition of the painting guides the eye to move through the scene, from the cypress to the dynamic sky and then towards the sleeping village. [2]
When asking the system to provide a description of the painting for girls or boys aged eight or nine, the result is instead the following:
This painting looks like a party in the sky and was painted a long time ago by a painter named Vincent van Gogh, who loved to use many colors and make all the brushstrokes curved and wavy.
The painting is called The Starry Night, and if you look at it, you can see lots of little stars spinning and chasing each other in the sky. The sky looks painted with waves of blue and yellow, and if you focus, you can almost hear the music of the stars.
Under this sky that seems to move, there is a little village sleeping under a big dark blanket. The little houses have their lights off because it is night and everyone is dreaming. There is also a little church with a bell tower that seems to touch the sky, almost like it wants to reach the moon.
And then there is a huge tree that looks like a giant green feather, and it seems like it is tickling the stars. Van Gogh painted this tree to connect the earth with the sky, helping us understand that everything is connected.
When you look at this painting, you can imagine being part of this magical world that van Gogh created. And even though we are in a museum, if you close your eyes, you can dream of being there, in the middle of that party in the sky. [3]
Here is the response to the request to describe the painting from a more strictly pictorial point of view:
From a pictorial point of view, Vincent van Gogh’s The Starry Night is a masterpiece of expressionist technique, where the use of color and brushstrokes is as emotional as it is innovative. Van Gogh uses the impasto technique, applying color in dense and thick layers, which adds rich three-dimensionality to the canvas.
The brushstrokes are vigorous and visible, creating a pulsating rhythm across the entire surface. The artist makes circular movements to represent the sky, creating vortices that capture the eye and guide it through the composition. These brushstrokes not only define the shape and movement of the sky but also the brightness of the stars and the moon.
Van Gogh chooses a color palette that is both realistic and fantastic. He uses shades of blue and cyan for the night sky, interspersed with touches of yellow, white, and orange to represent the brightness of the stars and the moon. The contrast between the warm yellow of the stars and the cool blue of the sky creates a vibrant light effect that seems to emanate from within the painting.
The landscape below is more earthly and reassuring, with blues and greens creating a sense of calm stability, in sharp contrast to the turmoil of the sky. The cypress, which stands out in the foreground, is rendered with vertical and sinuous brushstrokes, emphasizing its nature as a link between earth and sky. In this painting, van Gogh does not care about photographic precision but rather expresses his emotional experience of the landscape, using painting as a means to convey his turbulent inner struggle. The pictorial technique becomes an extension of his psyche, making The Starry Night a deeply personal and universal work at the same time. [4]
The examples given are obviously only samples, and – through the mediation of the prompt or an appropriate context (in this case, 'context' means a set of information provided to the system after its training, aimed at directing the output in the most suitable way according to the specific objectives to be achieved) – the system can further improve and personalize the description according to needs.
Van Gogh’s painting is well known, probably part of the training corpus on which the ChatGPT 4V VLM was built, and the system knows how to identify it; but, let us remember, the training corpus is not used as a database: the outputs produced by ChatGPT are not 'copied' from the information it was trained on but generated generatively, one token at a time, based on statistical probabilities. We can verify this by noting that the results are qualitatively excellent – and in some ways even more impressive – if we use lesser-known or completely unknown images. Fabio Ciotti discussed this topic in a very interesting Facebook thread, showing how the system describes – correctly capturing even emotional nuances – a painting created by his daughter and certainly not present on the web. [5]
The ability to generate detailed and customizable multimodal descriptions represents a true revolution in the field of accessibility: some cases are already quite well known [6], many others are being explored precisely in these months. Among the effects of this revolution will probably also be the rethinking of some choices that seemed consolidated. For example, in the field of web accessibility and digital publishing, a key element is represented by alternative textual descriptions provided through the 'ALT' attribute of the 'IMG' tag. However, this attribute has (at least) two different functions: on the one hand, alternative textual descriptions assist in the interpretation of images by people with visual disabilities; on the other hand, they are also a useful tool for search engines, both to improve text indexing and to allow more effective searching and retrieval of images appearing therein. In other terms, the text provided through the 'ALT' attribute is both an accessibility tool and a tool for search engine optimization (SEO). It is easy to understand that these two different functions could suggest the use of at least partly different descriptive texts. On the other hand, the use of alternative textual descriptions is far from as widespread as it would be desirable: in May 2022, on the occasion of Global Accessibility Awareness Day, Twitter (now X) provided an impressive statistic on this matter: only 0.6% of images used in Tweets are accompanied by 'alt text' [7]. This means that over 99% of tweets with images are effectively inaccessible to a blind person. The generation of 'alt text' using generative AI systems, possibly also to integrate already produced content, can help reverse this situation and also makes it possible to consider the use of multiple textual descriptions, depending on different purposes. Work in this direction has just begun, and in the first image-to-speech systems, there was the difficulty of enabling generative AI systems to grasp the contexts of the images to be described: the description “man in white coat holds an orange cat still on a white table” is more detailed but probably less informative than the simpler “veterinarian examining a cat” [8]; natively multimodal systems should allow decidedly better results in this field.
This contribution is excerpted from the article:Multimodal Generative Artificial Intelligence in the Humanities. Some Introductory Considerations, 2024, DIGITCULT, 8 (2), 127-137.
[1] Here I have tried to summarize in a concise form the main common features of some of the many and diverse definitions of the concept of accessibility proposed in the literature; for a fairly recent overview that includes some of the most important reference works on the subject, especially in the field of human-computer interaction, see Mack et al., 2021.
[2] The response was given by ChatGPT 4 V on December 28, 2023. Prompt: “Provide me with a detailed description of this image.”
[3] The response was given by ChatGPT 4 on December 28, 2023. Prompt: “Now provide me with a description of the image suitable for 8-9-year-old children.”
[4] The response was given by ChatGPT 4 on December 28, 2023. Prompt: “Provide me with a description of the image from the point of view of the painting technique used.”
[5] Fabio Ciotti, Facebook post of October 14, 2023, accessible here. For the description of the Klee painting to which the post refers, see here.
[6] To give just one example, the Be My Eyes app, which uses the smartphone camera to provide a blind user with a description of the surrounding environment, is moving from the use of human volunteers (since 2015, more than 6 million volunteers had collaborated in providing voice descriptions of what the smartphone of people using the app filmed) to the use of ChatGPT 4, with significant advantages both in the speed of description and with regard to the privacy of the blind person: see here.
[7] https://twitter.com/XA11y/status/1527287962479869952.
[8] The example is in the blog: Be Careful When Using A.I. for Alternative Text on the Bureau of Internet Accessibility , October 26, 2023.