Organisatorenbeitrag: What distinguishes Humanities’ texts from other data?

Manfred Thaller, University at Cologne

(i) The Humanities are a very broad field. The following ideas relate to those Humanities disciplines, which are dealing with “historical texts” –or at least they started from them. “Historical” in this context defines any text, which has been created by actors, which we cannot consult any more. This creates a complication when we understand an existing text as a message from a sender to a recipient – an understanding which is absolutely fundamental to modern information technology, as it is the model which has been used within Shannon’s article of 1948, one of the corner stones of modern information theory and for most computer scientist, the corner stone of Computer Science upon which the later has been built. All of the measures Shannon proposes require an understanding, what the message that has been transmitted by the sender contained before transmission. Another important restriction Shannon acknowledges himself:

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem.

(Shannon, 1948, 379)

The fact that information processing systems start with a model which ignores semantics from page one. is ultimately the reason, why meaning has to be added to the signal stream in ways, which allow the transmission (or processing) of that information as an integral part of the signal stream – today usually as embedded markup. Embedded into a signal stream, which has been created by a sender; so embedding anything into it would, according to the model of Shannon, require the markup being part of the transmitted message. This is indeed, what SGML has been created for: To enter the intentions of the producer of a document about the formatting (and, up to a degree the meaning) of a data stream in such a way, that they would be independent of the requirements of specific devices.

When we are not able to check the meaning of a message with the sender we have to distinguish between the message, even if we do not understand it, and our assumptions about interpreting them. As we do not know the intent of the sender, the result of the “transmission”of a historical text across time cannot be determined conclusively.

(ii) That data – as transmitted in signal streams – and information, as handled by humans, are not identical is a truism. They have long been seen as separate strata in information theory. (For a recent overview of the discussion see Rowley 2007.) A main difference between Shannon and the “data – information – knowledge – wisdom” hierarchy has always been, that the former leads directly to an intuitive understanding of systems which can be realized by software engineering, while the later cannot. This is also true of attempts to use a similar scheme to understand information systems, notably Langefors 1995 infological equation.

(2) Ix = i (Ix-α, s(Ix-β, t), t)

Roughly: Information at point x is the result of the interpretation of an earlier level of information, in the light of knowledge generated from earlier knowledge, at a point of time t. As this allows the interpretation of data – e.g. a “transmission” of a sender not living any more - as a process, which does not have to terminate, it is a better model for the handling of Humanities’ texts as Shannon’s.

(iii) This abstract model can be turned into an architecture for a representation of information, which can be processed by software. Thaller (2009b) has lead a project team within the digital preservation project PLANETS (cf. http://www.planets-project.eu/), which used this abstract model for the development of tools, which work on the comparison of the information contained within two different representations of an item according to two different technical formats. (Roughly: Does a PDF document contain exactly the same “text” as a Word document.) For this purpose it is assumed, that all information represented in persistent form on a computer consists of a set of tokens carrying information, which exists within an n-dimensional interpretative space, each dimension of that space describing one “meaning” to be assigned to it. Such a meaning can be a request directed at the rendering system processing the data to render a byte sequence in a specific way, or a connection to a semantic label empowering an information retrieval system. As such a representation is fully recursive, the requirements of formalism (2) above are fulfilled. For texts this can be simplified to an introductory example, where a text is seen as a chain of characters, each of which can be described by arbitrarily many orthogonal properties. (Whether the string Biggin within a text describes a person or an airfield is independent of whether that string is represented as italics or not; whether the string “To be or not to be” is assigned to the speaker Hamlet is independent of whether it appears on page 13 or 367 of a book.)

(iv) Returning to the argument of section (i) we can see, that there is a direct correspondence between the two arguments. On the one hand the necessity to keep (a) the symbols transmitted within a “message” from a sender who is irrevocably in the past and (b) our intellectual interpretations of them cleanly and unmistakably separate. On the other hand the necessity to distinguish clearly between (a) the tokens which transmit the data contained within a byte stream and (b) the technical information necessary to interpret that byte stream within a rendering system. If it is useful to transfer information transported within files with different formats into a representation, where the transmitted data are kept completely separate from the technical data needed to interpret them on a technical level, it is highly plausible, that that is even more the case, when we are discussing interpretations of texts left to us by authors we can not consult any more.

This in turn is highly compatible to an architecture for virtual research environments for manuscript related work, where Humanities’ work on historical texts is understood to consist of adding layers of changing and potentially conflicting interpretation unto a set of images of the manuscript to be interpreted. Ebner et al. 2011 have recently described an architecture for a virtual research environment for medieval manuscripts which implements this overall architecture, though using embedded markup for some of the layers for the time being.

To summarize the argument: (1) All texts, for which we cannot consult the producer, should be understood as a sequence of tokens, where we should keep the representation of the tokens and the representation of our interpretation thereof completely separate. (2) Such representations can be grounded in information theory. (3) These representations are useful as blueprints for software on highly divergent levels of abstraction.

References

  • Ebner, Daniel; Graf, Jochen; Thaller, Manfred (2011): “A Virtual Research Environment for the handling of medieval charters”, paper presented at the conference Supporting Digital Humanities: Answering the unaskable, Copenhagen November 17th / 18th, 2011. (http://www.hki.uni-koeln.de/sites/all/files/VdUSDH11-02.pdf)
  • Langefors, Börje (1995): Essays on Infology, Lund: Studentliteratur.
  • Rowley, Jennifer (2007): “The wisdom hierarchy: representations of the DIKW hierarchy”, in: Journal of Information Science 33 (2007), 163 – 180.
  • Shannon, Claude Elwood (1948): “A Mathematical Theory of Communication”, in: Bell System Technical Journal, 1948 (Juli, Oktober), 379–423, 623–656.
  • Thaller, Manfred (2009a): “The Cologne Information Model: Representing information Persistently”, in: Thaller 2009b, 223 - 240.
  • Thaller, Manfred (2009b): The eXtensible Characterisation Languages – XCL, Hamburg.