Now for something a little different. I mentioned before that we can conduct similar analyses on pieces of the plays rather than the plays as a whole. In this experiment, I have been working with 1000 word chunks of Shakespeare plays, which allows me to use many more variables in the analysis. (This was the technique that Hope and I used in our 2007 article on Tragicomedy.) Obviously the plays weren’t written to be read, much less analyzed, in identically sized pieces: the procedure is artificial through and through. It does allow us, however, to see things that Shakespeare does consistently throughout different genres, things that happen repeatedly throughout an entire play rather than just the beginning or end. Another caveat: we partitioned the plays starting at the beginning of each text, making the first 1000 words the first “piece.” This results in a loss of some of the playtext at the end, since any remainder that is less than 1000 words is dropped. In future analyses, we will take evenly spaced 1000 word samples from beginning to end, partitioning losses in between. There are no perfect answers here when it comes to dividing the plays into working units. So this is a first installment.
The video above (press to play) is a three dimensional JMP plot of 767 pieces of Shakespeare in a dataspace of three scaled Principal Components (1, 4, and 9) which I have chosen based on their power to sort the plays using in the Tukey Test. (See Tukey results for PCs 1 and 4.) When you run the video capture, you’ll see a series of dots that are color coded based on generic differences: red is comedy, green history, blue is late plays and orange tragedies. Early in the capture, I move an offscreen slider that creates a series of chromatic “halos” or elipsoid bubbles around neighboring dots: these halos envelop dot groupings as they meet certain contiguity thresholds. You see the two major clusters I am interested in here, histories and comedies, forming in the lower left and upper right respectively. (Green on lower left, red on upper right.) Interestingly enough, the see-saw effect we saw in our analysis of entire plays is repeated here: comedies and histories are the most easily separated, because whenever Shakespeare is using strings associated with comedy, he can’t or won’t simultaneously use strings associated with history (and vice versa). Linguistic weight cannot be placed both sides of this particular generic fulcrum at once.
Now the resulting encrusted object, which I have rotated in three dimensions, is a lot less elegant than the object we would be contemplating were to do discriminant analysis of these groups. I am saving Discriminant Analysis for a later post. For all its imperfections, Principal Component Analysis is still going to give us some results or linguistic patterns we can make sense of, which is the ultimate measure of success here. I think it’s worth appreciating the spatial partitioning here in all of its messiness: the multicolored object presents both a pattern that we are familiar with — comedies and histories really do flock to opposite ends of the containing dataspace — and some jagged edges that show the imperfections of the analysis. Imperfections are good: we want to find exceptions to generic rules, not just confirmations of a pattern.
Looking at the upper right hand quadrant, we see the items that are high on both PC1 and PC4. In this analysis we are using Language Action Types or LATs, the finest grained categories that Docuscope uses (it has 101 of them). We will want to ask which specific LATs are pushing items into the different areas here, and to do so, I have produced the following loading biplot:
A loadings biplot gives information about components in spatial form, showing our different analytic categories (LAT’s such as “Common Authorities,” “DenyDisclaim,” “SelfDisclosure,” etc.) as red arrows or vectors. To read this diagram, consider the two components individually. What makes an item high on PC1? Since PC1 is rated on the horizontal axis, we scan left to right for the vectors or arrows that are at the extremes. To my eye, SelfDisclosure, FirstPer[son] and DirectAddress are the most strongly “loaded” on this component, which means that any piece that has a relatively high score on these variables will be favored by this component and thus pushed to he right had side of a scatterplot (see below). Conversely, any item that is relatively low in the words that fall under categories such as Motions, SenseProperty, Sense Object, and Inclusive will be pushed to the left. Notice that the two variables SelfDisclosure and SenseObject are almost directly opposed: the loadings biplot is telling us here that, statistically at least, the use of this one type of word (or string of words) seems to preclude the use of its opposite. This would be true of all the longer vector arrows in the diagram that extend from opposite sides of the origin.
We can then do the same thing with the vertical axis, which represents PC4. Here we see that LangRef [Language Reference], DenyDisclaim and Uncertainty strings are used in opposition to those classed under the LAT Common Authority. If an item scores high on PC4 (which most comedies do), it will be high in LangRef, Uncertainty and DenyDisclaim strings while simultaneously lacking Common Authority strings. So what about the vectors that bisect the axes, for example, DenyDisclaim, which appears to load positively on both PC1 and PC2? This LAT is shared by the two components: it does something for both. We can learn a lot by looking at this diagram, since — once we’ve decided that these components track a viable historical or critical distinction among texts — it shows us certain types of language “schooling together” in the process of making this distinction. DirectAddress and FirstPer [or, First Person], Autobio and Acknowledge thus tend to go together here (lower right), as do Motions, SenseProperties, and Sense Objects (upper left).
In fact, the designer of Docuscope saw these LATs as being related, which is why elsewhere he aggregated them together into larger “buckets” such as Dimensions or Clusters, the latter being the aggregate we used in our analysis of full plays. What we’re seeing here is a kind of “schooling of like LATs in the wild,” where words that are grouped together on theoretical grounds are associating with one another statistically in a group of texts. If the intellectual architecture of Docuscope’s categories is good, this schooling should happen with almost any biplot of components, no matter what types of texts they discriminate. The power of this combination of Principal Components, then, is that it aligns the filiations and exclusions of the underlying language architecture with genres that we recognize, and will hopefully suggest theatrical or narrative strategies that support these recognizable divisions.
The loadings biplot shows us how the variables in our analysis are pushing items in the corpus into different regions of a dataspace. We can now populate that dataspace with the 767 pieces of Shakespeare’s plays, rating each of them on the two components. Here is how the plays appear in a plot of scaled Component 1 against Component 2, again, color coded with the scheme used above:
Notice the pattern we’ve seen before: comedies (here represented in red) are opposite histories (green) in diagonal quadrants. In general, they don’t mingle. The upper right hand quadrant, which is where the comedies tend to locate, contains the first item that I’d like to discuss: the red dot labelled Merry Wives (circa 2.1). This dot represents a piece of the first scene, second act of The Merry Wives of Windsor. As the item that rates highest on both PC1 and PC4 — components which the Tukey Test shows us to be best at discriminating comedy — this piece of The Merry Wives of Windsor is the most comic 1000 word passage that Shakespeare wrote. Here is an excerpt:
“I’ll entertain myself like one that I am not acquainted withal; for, sure, unless he know some strain in me, that I know not myself, he would never have boarded me in this fury.” In this color coded sentence we can see diagrammed the comic dance step. While I think there are funnier lines — “I had rather be a giantess, and lie under Mount Pelion” — the former is significant for what it does linguistically: it shows a speaker entertaining and then rejecting a perspective on her own situation (that of Falstaff) while comparing it with another (her own). The uncertainty strings (orange) such as “know not,” “doubt” and the indefinite “some” contribute to this mock searching rhetoric. Self-disclosure strings such as “myself” and “makes me” anchor the reality testing exercise to the speaker, who must make explicit her own place in the sentence as the object of doubt, while the oppositional reasoning strings such as “never” and “not” mark the mobility of this speakers perspective: I will try this toying perspective on my honesty, seeing myself as Jack Falstaff does, but will reject it soon enough. The reason that this passage is so highly rated on these two factors has something to do with the multiplication of perspectives that are being juggled onstage: there are two individuals here — Mistress Page and Mistress Ford — who are, as it were, rising above an imbedded perspective contained in Falstaff’s letter, commenting upon that perspective, and then rejecting it. Each time a partition in reality (a level) is broached in the stage action and dialogue, comic language appears.
We can oppose this most comic piece of writing — again, according to PCA — to its opposite in linguistic terms, a piece that contains what the comic one lacks and lacks what the comic one has. Here, then, is a portion of the “most historical” piece of Shakespeare, from Richard II 1.3:
Here we see the formal settings of royal display, a herald offering Mowbray’s formal challenge — no surprise this exemplifies history, a genre in which the nation and its kings are front and center. Yet where the passage really begins to rack up points is in its use of descriptive words, which are underlined in yellow. Chairs, helmets, blood, earth, gentle sleep, drums, quite confines…we don’t think of history as the genre of objects and adjectives, but linguistically it is. Inclusive strings, in the olive colored green, are perhaps less surprising given our previous analyses. We expect kings to speak about “our council” and what “we have done.” But notice that such language is quite difficult to use in comedy: even in a passage of collusion, where we would expect Mistress page and Mistress Ford to be using first person plural pronouns, the language tends to pivot off of first person singular perspectives. The language of “we” really isn’t a part of comedy.
I am less surprised to find, at this finer grained level of analysis, words from official life (what Docuscope tags as Commonplace Authority, in bright green) associated with history, since these are context specific. More interesting is the presence of the purple words, which Docuscope tags as person properties. These are high in history, but show up in comedy as well, as you can see on the loading biplot above. This marked up passage is also useful because it shows us something we’d want to disagree with: you don’t have to be Saul Kripke to see that a proper name like Henry is an imperfect designator of persons, particularly because other proper names such as Richard do not get counted under this category by Docuscope. We live with the imperfections, unless it appears that there are so many mentions of the name Henry in the plays that this entire LAT category must be discounted.


















Platonic Dialogues and the “Two Socrates”
Press to Start: Vlastos (1991) Groupings, PCA on Correlations
I have been thinking for a while now that Docuscope preserves, in its tagging structure, what a translator preserves — that this is a good definition of what it is looking to classify. One way to test this hypothesis would be to try Docuscope on a set of translations, which is what I’ve tried to do here.
The visualization above (press to rotate) shows the Platonic Corpus as translated by the nineteenth-century classicist Benjamin Jowett, rated by principal components on correlations and color coded by the divisions proposed by the great Plato scholar Gregory Vlastos (1991), whose division of the dialogues into early (red), middle (blue), and late (green) are highlighted here. (The semitransparent elipsoids are drawn to capture 50 percent of the items in the group.) Vlastos argued, on the basis of the types of arguments used in these texts, that the early dialogues represent a distinct group from those produced in the middle or later periods. The mode of argument in these earlier dialogues, he observes, is elenctic or adversative, which means that in these dialogues Socrates does not “defend a thesis of his own” but rather examines one held by an interlocutor (113). Socrates thus avoids making knowledge claims in these dialogues, instead forcing his interlocutors to enunciate them as the weakness of their own positions becomes apparent. Believing that there are two “Socrates” presented in these dialogues, Vlastos argues that the early Socrates — who likely represents the philosophical position of the historical Socrates rather than Plato — must rely on the “’say what you believe’ rule” (113), this rule supplying the rough materials of his proofs. As epistemologist (which he is not in these dialogues), Socrates does not advance certain knowledge claims: the elenctic method will not support them.
The middle and later Socrates, by contrast, is fully willing to advance certain knowledge claims, which he seeks to present demonstratively (48). Rather than being simply a moral philosopher, he is now a “moral philosopher and metaphysician and epistemologist and philosopher of science and philosopher of language and philosopher of religion and philosopher of eduction and philosopher of art.” In these dialogues, Socrates advances a theory of knowledge as the recollection of separately existing Forms – a significant epistemological leap. This Socrates is now a spokesman for Plato, making the most important division of the corpus that between the early dialogues and all the rest.
Taking this division as a starting point, let’s look at how Docuscope divides the dialogues, which it does here simply on the basis of mean scores on all 101 of the Language Action Types. These scores are plotted in a hyperspace and then the least dissimilar items are paired using Ward’s method on unscaled data. The technique is the same as the one that produced the most effective genre clustering of Shakespeare’s plays. I am thus using what I know of a particular mathematical technique as it applies to historically accepted clusterings of Shakespeare’s plays and applying it to a body of works that is less familiar to me – not quite what Franco Moretti calls “the great unread,” but definitely a case of trying to understand the lesser known through the better known.
Wards Clustering on Translated Plato Dialogues
As you can see from the clustering of red or early period dialogues above, we can arrive at an arrangement of the dialogues using Docuscope data that is remarkably similar to the basic division in the dialogues that Vlastos argued for in 1991. But what is perhaps most interesting is that roughly the same division was arrived at stylometrically in the late nineteenth century, and that there has been at least some convergence within Plato studies of what we might call “intensive” techniques for sorting the dialogues (based on reactions of readers to the doctrines or manner of presentation) and “extensive” ones (built on groups that themselves represent the capture of stylometrically significant counted items). As Brandwood shows in The Chronology of Plato’s Dialogues (1990), it was already apparent to computationally unassisted readers of Plato such as L. Campbell that the later dialogues exhibited more technical and rare words, as well as a “peculiar, stately rhythm.” These claims were advanced with quantitative evidence (Campbell, 1867) but were grounded in an impression gathered through close and repeated reading. This line of inquiry was also taken up by the German classicist W. Dittenberger, who in 1896 argued that early and later dialogues could be discriminated by looking at the particles καἰ μήν, ἀλλὰ μήν, which co-occur in the early dialogues, and τί μήν, ἀλλὰ…μήν, and γε μήν, which co-occur in the later ones. This essentially multivariate pattern yielded the early grouping: Crito, Euthyphro, Progagoras, Charmides, Laches, Euthydemus, Meno, Gorgias, Cratylus, Phaedo. As you can see from the above, Vlastos’ groupings and those of Dittenberger overlap significantly. To this we might add the groupings derived from the Docuscope codings.
This convergence is interesting for a number of reasons. First, it shows us extensive and intensive techniques working in tandem, which raises the basic question of how these two things are related. Second, it shows us how a certain conversational style or dialogical setting connects with a philosophical position, and how may themselves become available for analysis through the counting of seemingly inconsequential particles such as μήν. The Platonic corpus is an excellent one to work with because it has been well studied, and we have the advantage of pre-computational techniques to examine alongside actual readers’ responses. In my next post, I will examine those features in the translated dialogues that – once tagged by Docuscope – seem to be doing a good job of reproducing the scholarly divisions described above.