Platonic Dialogues and the “Two Socrates”

Press to Start: Vlastos (1991) Groupings, PCA on Correlations

I have been thinking for a while now that Docuscope preserves, in its tagging structure, what a translator preserves — that this is a good definition of what it is looking to classify. One way to test this hypothesis would be to try Docuscope on a set of translations, which is what I’ve tried to do here.

The visualization above (press to rotate) shows the Platonic Corpus as translated by the nineteenth-century classicist Benjamin Jowett, rated by principal components on correlations and color coded by the divisions proposed by the great Plato scholar Gregory Vlastos (1991), whose division of the dialogues into early (red), middle (blue), and late (green) are highlighted here. (The semitransparent elipsoids are drawn to capture 50 percent of the items in the group.) Vlastos argued, on the basis of the types of arguments used in these texts, that the early dialogues represent a distinct group from those produced in the middle or later periods. The mode of argument in these earlier dialogues, he observes, is elenctic or adversative, which means that in these dialogues Socrates does not “defend a thesis of his own” but rather examines one held by an interlocutor (113). Socrates thus avoids making knowledge claims in these dialogues, instead forcing his interlocutors to enunciate them as the weakness of their own positions becomes apparent. Believing that there are two “Socrates” presented in these dialogues, Vlastos argues that the early Socrates — who likely represents the philosophical position of the historical Socrates rather than Plato — must rely on the “’say what you believe’ rule” (113), this rule supplying the rough materials of his proofs. As epistemologist (which he is not in these dialogues), Socrates does not advance certain knowledge claims: the elenctic method will not support them.

The middle and later Socrates, by contrast, is fully willing to advance certain knowledge claims, which he seeks to present demonstratively (48). Rather than being simply a moral philosopher, he is now a “moral philosopher and metaphysician and epistemologist and philosopher of science and philosopher of language and philosopher of religion and philosopher of eduction and philosopher of art.” In these dialogues, Socrates advances a theory of knowledge as the recollection of separately existing Forms – a significant epistemological leap. This Socrates is now a spokesman for Plato, making the most important division of the corpus that between the early dialogues and all the rest.

Taking this division as a starting point, let’s look at how Docuscope divides the dialogues, which it does here simply on the basis of mean scores on all 101 of the Language Action Types. These scores are plotted in a hyperspace and then the least dissimilar items are paired using Ward’s method on unscaled data. The technique is the same as the one that produced the most effective genre clustering of Shakespeare’s plays. I am thus using what I know of a particular mathematical technique as it applies to historically accepted clusterings of Shakespeare’s plays and applying it to a body of works that is less familiar to me – not quite what Franco Moretti calls “the great unread,” but definitely a case of trying to understand the lesser known through the better known.

Wards Clustering on Translated Plato Dialogues

As you can see from the clustering of red or early period dialogues above, we can arrive at an arrangement of the dialogues using Docuscope data that is remarkably similar to the basic division in the dialogues that Vlastos argued for in 1991. But what is perhaps most interesting is that roughly the same division was arrived at stylometrically in the late nineteenth century, and that there has been at least some convergence within Plato studies of what we might call “intensive” techniques for sorting the dialogues (based on reactions of readers to the doctrines or manner of presentation) and “extensive” ones (built on groups that themselves represent the capture of stylometrically significant counted items). As Brandwood shows in The Chronology of Plato’s Dialogues (1990), it was already apparent to computationally unassisted readers of Plato such as L. Campbell that the later dialogues exhibited more technical and rare words, as well as a “peculiar, stately rhythm.” These claims were advanced with quantitative evidence (Campbell, 1867) but were grounded in an impression gathered through close and repeated reading. This line of inquiry was also taken up by the German classicist W. Dittenberger, who in 1896 argued that early and later dialogues could be discriminated by looking at the particles καἰ μήν, ἀλλὰ μήν, which co-occur in the early dialogues, and τί μήν, ἀλλὰ…μήν, and γε μήν, which co-occur in the later ones. This essentially multivariate pattern yielded the early grouping: Crito, Euthyphro, Progagoras, Charmides, Laches, Euthydemus, Meno, Gorgias, Cratylus, Phaedo. As you can see from the above, Vlastos’ groupings and those of Dittenberger overlap significantly. To this we might add the groupings derived from the Docuscope codings.

This convergence is interesting for a number of reasons. First, it shows us extensive and intensive techniques working in tandem, which raises the basic question of how these two things are related. Second, it shows us how a certain conversational style or dialogical setting connects with a philosophical position, and how may themselves become available for analysis through the counting of seemingly inconsequential particles such as μήν. The Platonic corpus is an excellent one to work with because it has been well studied, and we have the advantage of pre-computational techniques to examine alongside actual readers’ responses. In my next post, I will examine those features in the translated dialogues that – once tagged by Docuscope – seem to be doing a good job of reproducing the scholarly divisions described above.

Posted in Counting Other Things | Tagged , , , | Leave a comment

The Funniest Thing Shakespeare Wrote? 767 Pieces of the Plays

Press to Play: 767 Pieces of Shakespeare in Scaled PCA Space

Now for something a little different. I mentioned before that we can conduct similar analyses on pieces of the plays rather than the plays as a whole. In this experiment, I have been working with 1000 word chunks of Shakespeare plays, which allows me to use many more variables in the analysis. (This was the technique that Hope and I used in our 2007 article on Tragicomedy.) Obviously the plays weren’t written to be read, much less analyzed, in identically sized pieces: the procedure is artificial through and through. It does allow us, however, to see things that Shakespeare does consistently throughout different genres, things that happen repeatedly throughout an entire play rather than just the beginning or end. Another caveat: we partitioned the plays starting at the beginning of each text, making the first 1000 words the first “piece.” This results in a loss of some of the playtext at the end, since any remainder that is less than 1000 words is dropped. In future analyses, we will take evenly spaced 1000 word samples from beginning to end, partitioning losses in between. There are no perfect answers here when it comes to dividing the plays into working units. So this is a first installment.

The video above (press to play) is a three dimensional JMP plot of 767 pieces of Shakespeare in a dataspace of three scaled Principal Components (1, 4, and 9) which I have chosen based on their power to sort the plays using in the Tukey Test. (See Tukey results for PCs 1 and 4.) When you run the video capture, you’ll see a series of dots that are color coded based on generic differences: red is comedy, green history, blue is late plays and orange tragedies. Early in the capture, I move an offscreen slider that creates a series of chromatic “halos” or elipsoid bubbles around neighboring dots: these halos envelop dot groupings as they meet certain contiguity thresholds. You see the two major clusters I am interested in here, histories and comedies, forming in the lower left and upper right respectively. (Green on lower left, red on upper right.) Interestingly enough, the see-saw effect we saw in our analysis of entire plays is repeated here: comedies and histories are the most easily separated, because whenever Shakespeare is using strings associated with comedy, he can’t or won’t simultaneously use strings associated with history (and vice versa). Linguistic weight cannot be placed both sides of this particular generic fulcrum at once.

Now the resulting encrusted object, which I have rotated in three dimensions, is a lot less elegant than the object we would be contemplating were to do discriminant analysis of these groups. I am saving Discriminant Analysis for a later post. For all its imperfections, Principal Component Analysis is still going to give us some results or linguistic patterns we can make sense of, which is the ultimate measure of success here. I think it’s worth appreciating the spatial partitioning here in all of its messiness: the multicolored object presents both a pattern that we are familiar with — comedies and histories really do flock to opposite ends of the containing dataspace — and some jagged edges that show the imperfections of the analysis. Imperfections are good: we want to find exceptions to generic rules, not just confirmations of a pattern.

Looking at the upper right hand quadrant, we see the items that are high on both PC1 and PC4. In this analysis we are using Language Action Types or LATs, the finest grained categories that Docuscope uses (it has 101 of them). We will want to ask which specific LATs are pushing items into the different areas here, and to do so, I have produced the following loading biplot:

A loadings biplot gives information about components in spatial form, showing our different analytic categories (LAT’s such as “Common Authorities,” “DenyDisclaim,” “SelfDisclosure,” etc.) as red arrows or vectors. To read this diagram, consider the two components individually. What makes an item high on PC1? Since PC1 is rated on the horizontal axis, we scan left to right for the vectors or arrows that are at the extremes. To my eye, SelfDisclosure, FirstPer[son] and DirectAddress are the most strongly “loaded” on this component, which means that any piece that has a relatively high score on these variables will be favored by this component and thus pushed to he right had side of a scatterplot (see below). Conversely, any item that is relatively low in the words that fall under categories such as Motions, SenseProperty, Sense Object, and Inclusive will be pushed to the left. Notice that the two variables SelfDisclosure and SenseObject are almost directly opposed: the loadings biplot is telling us here that, statistically at least, the use of this one type of word (or string of words) seems to preclude the use of its opposite. This would be true of all the longer vector arrows in the diagram that extend from opposite sides of the origin.

We can then do the same thing with the vertical axis, which represents PC4. Here we see that LangRef [Language Reference], DenyDisclaim and Uncertainty strings are used in opposition to those classed under the LAT Common Authority. If an item scores high on PC4 (which most comedies do), it will be high in LangRef, Uncertainty and DenyDisclaim strings while simultaneously lacking Common Authority strings. So what about the vectors that bisect the axes, for example, DenyDisclaim, which appears to load positively on both PC1 and PC2? This LAT is shared by the two components: it does something for both. We can learn a lot by looking at this diagram, since — once we’ve decided that these components track a viable historical or critical distinction among texts — it shows us certain types of language “schooling together” in the process of making this distinction. DirectAddress and FirstPer [or, First Person], Autobio and Acknowledge thus tend to go together here (lower right), as do Motions, SenseProperties, and Sense Objects (upper left).

In fact, the designer of Docuscope saw these LATs as being related, which is why elsewhere he aggregated them together into larger “buckets” such as Dimensions or Clusters, the latter being the aggregate we used in our analysis of full plays. What we’re seeing here is a kind of “schooling of like LATs in the wild,” where words that are grouped together on theoretical grounds are associating with one another statistically in a group of texts. If the intellectual architecture of Docuscope’s categories is good, this schooling should happen with almost any biplot of components, no matter what types of texts they discriminate. The power of this combination of Principal Components, then, is that it aligns the filiations and exclusions of the underlying language architecture with genres that we recognize, and will hopefully suggest theatrical or narrative strategies that support these recognizable divisions.

The loadings biplot shows us how the variables in our analysis are pushing items in the corpus into different regions of a dataspace. We can now populate that dataspace with the 767 pieces of Shakespeare’s plays, rating each of them on the two components. Here is how the plays appear in a plot of scaled Component 1 against Component 2, again, color coded with the scheme used above:

Notice the pattern we’ve seen before: comedies (here represented in red) are opposite histories (green) in diagonal quadrants. In general, they don’t mingle. The upper right hand quadrant, which is where the comedies tend to locate, contains the first item that I’d like to discuss: the red dot labelled Merry Wives (circa 2.1). This dot represents a piece of the first scene, second act of The Merry Wives of Windsor. As the item that rates highest on both PC1 and PC4 — components which the Tukey Test shows us to be best at discriminating comedy — this piece of The Merry Wives of Windsor is the most comic 1000 word passage that Shakespeare wrote. Here is an excerpt:

“I’ll entertain myself like one that I am not acquainted withal; for, sure, unless he know some strain in me, that I know not myself, he would never have boarded me in this fury.” In this color coded sentence we can see diagrammed the comic dance step. While I think there are funnier lines — “I had rather be a giantess, and lie under Mount Pelion” — the former is significant for what it does linguistically: it shows a speaker entertaining and then rejecting a perspective on her own situation (that of Falstaff) while comparing it with another (her own). The uncertainty strings (orange) such as “know not,” “doubt” and the indefinite “some” contribute to this mock searching rhetoric. Self-disclosure strings such as “myself” and “makes me” anchor the reality testing exercise to the speaker, who must make explicit her own place in the sentence as the object of doubt, while the oppositional reasoning strings such as “never” and “not” mark the mobility of this speakers perspective: I will try this toying perspective on my honesty, seeing myself as Jack Falstaff does, but will reject it soon enough. The reason that this passage is so highly rated on these two factors has something to do with the multiplication of perspectives that are being juggled onstage: there are two individuals here — Mistress Page and Mistress Ford — who are, as it were, rising above an imbedded perspective contained in Falstaff’s letter, commenting upon that perspective, and then rejecting it. Each time a partition in reality (a level) is broached in the stage action and dialogue, comic language appears.

We can oppose this most comic piece of writing — again, according to PCA — to its opposite in linguistic terms, a piece that contains what the comic one lacks and lacks what the comic one has. Here, then, is a portion of the “most historical” piece of Shakespeare, from Richard II 1.3:

Here we see the formal settings of royal display, a herald offering Mowbray’s formal challenge — no surprise this exemplifies history, a genre in which the nation and its kings are front and center. Yet where the passage really begins to rack up points is in its use of descriptive words, which are underlined in yellow. Chairs, helmets, blood, earth, gentle sleep, drums, quite confines…we don’t think of history as the genre of objects and adjectives, but linguistically it is. Inclusive strings, in the olive colored green, are perhaps less surprising given our previous analyses. We expect kings to speak about “our council” and what “we have done.” But notice that such language is quite difficult to use in comedy: even in a passage of collusion, where we would expect Mistress page and Mistress Ford to be using first person plural pronouns, the language tends to pivot off of first person singular perspectives. The language of “we” really isn’t a part of comedy.

I am less surprised to find, at this finer grained level of analysis, words from official life (what Docuscope tags as Commonplace Authority, in bright green) associated with history, since these are context specific. More interesting is the presence of the purple words, which Docuscope tags as person properties. These are high in history, but show up in comedy as well, as you can see on the loading biplot above. This marked up passage is also useful because it shows us something we’d want to disagree with: you don’t have to be Saul Kripke to see that a proper name like Henry is an imperfect designator of persons, particularly because other proper names such as Richard do not get counted under this category by Docuscope. We live with the imperfections, unless it appears that there are so many mentions of the name Henry in the plays that this entire LAT category must be discounted.

Posted in Shakespeare | Tagged , , , , | Leave a comment

Clustering the Plays Without Principal Components

Folio plays clustered using all Language Action Types, Non-Standardized Data

Folio plays clustered using all Language Action Types, Non-Standardized Data

In comparison to the previous post, where we were using the plays’ scores on Principal Components to create clusters, here we are just using the percentage counts of the plays on all of the Language Action Types, the lowest level of aggregation in Docuscope’s taxonomy of words or strings of language. There are 101 Language Action Types or LATs, which is to say, buckets of words or strings of words that David Kaufer has classified as doing a certain kind of linguistic or rhetorical work in a text. I have made a table of examples of these types, taken from the George Eliot novel Middlemarch, which can be downloaded here.

I find this diagram more than a little unnerving. It is quite accurate in terms of received genre judgments — notice that almost all of the Folio history plays (in green) are correct — and there are nice clusterings of both tragedies (tan) and comedies (red). Henry VIII, which is here identified as a late play (blue), is placed in the cluster full of other late plays (including Coriolanus, which could just as easily have been coded blue). And plays with a similar tone — Titus, Lear, and Timon — are all grouped together as tragedies, separate from the other tragedies that are placed together further above. The strange pairing that repeats here from the Principal Component clusterings is Tempest plus Romeo and Juliet, something which merits further inquiry.

Why should a mechanical algorithm looking at distances between counts of things produce a diagram this accurate? I’m not really sure. The procedure involves arraying each of the 36 plays in a multidimensional space depending on its percentage score on each of the things being rated here — the LAT categories. So, if “Motion” strings are one category, you can imagine an X axis with the scores of all the plays on “Motion,” with a Y axis rating all the plays on “Direct Address” as below:

Direct Address and Motion Scores in two Dimensions

Direct Address and Motion Scores in two Dimensions

Now think about adding another score — First Person — to the third dimension, which will give us a spatial distribution of the plays and their scores on each of these three LATs:

Direct Address, First Person and Motion Scores of Folio Plays

Direct Address, First Person and Motion Scores of Folio Plays

Now, there are distances between all of the points here and various methods (single linkage, complete linkage, Ward’s) for expressing the degree to which items arranged in such a space can be grouped together in a hierarchy of filiation or likeness. If you multiply out all of the things being scored in this analysis — that is, all 101 Language Action Types — you end up with a multidimensional space that is unvisualizable. But there are still distances among items in this multidimensional space, distances that can be placed into the algorithms for producing the hierarchy of likeness. That is what is going on — using Ward’s procedure with non-standardized data — in the visualization at the beginning of this post.

As I’ve said before, a picture is nice, but just because you can reproduce a human classification with an algorithm doesn’t mean you’ve made any progress. You have to be able to show what’s going on in a text — which words are doing what things some or most of the time — before you can call your work an analysis. Perhaps that’s another reason why I find a diagram like this unnerving: I cannot work back from it to a passage in a text.

By standardizing the data, we get the following re-arrangement. I am unsure how to categorize the benefits of data standardization in this case, but think this is a comparatively less compelling diagram:

Clustering of Folio Plays using Standardized Data

Clustering of Folio Plays using Standardized Data

Posted in Shakespeare | Tagged , | 3 Comments

Shakespearean Dendrograms

PCA Scatterplot in R of the First Folio Plays
PCCovariance1and2WardDendrogram
Dendrogram produced in JMP on PC1 and PC2 using covariance matrix and Ward’s

PCCovariance1and2WardPairings

There is another way to visualize the degrees of similarity or dissimilarity among the items we’ve been analyzing in Shakespeare’s Folio plays. A dendrogram, which looks like a family tree, is a two dimensional representation of similarities derived from a statistical technique known as hierarchical clustering. I will say something about this technique in a moment, but first, a review of what we have here.

At the top of this post is the scatterplot we have been working with throughout our analysis of the full texts of the Folio plays. This graph plots the principal components (1 and 2) derived from an analysis of the plays in R using the command “prcomp” (where we have centered but not scaled the data). This analysis took place at the Cluster level — Docuscope can group its most basic token types (the LATs) into seventeen meta-types called Clusters — which we choose because we want fewer variables than observations. Thus, because have 36 plays in the Folio, we choose to group the items we have counted into seventeen buckets or containers, the Clusters. In previous posts, we tried to explain how the tokens collected in these clusters explain the filiations among the plays that have been discovered by unsupervised methods of statistical analysis — methods requiring no prior knowledge of which genres the plays are “supposed” to fit into.

There are all sorts of subtle trends that can be gleaned with the aid of statistics, and we will be exploring some of them in the future. I have begun with trends that are visible without much fancy footwork. Looking at unscaled findings of the first two principal components is a fairly vanilla procedure, statistically speaking, which means that its faults and virtues are well-known. It is a first glance at the nature of linked variation in the corpus. And when we look at this plot from above, we see the characteristic opposition of history and comedy, which employ different and — on the whole — opposed linguistic and rhetorical strategies in telling their stories on stage. But is there a way to quantify the degree to which Shakespeare’s are like or unlike one another when they are rated on these principal components?

The second and third illustrations provide this information. A dendrogram is a visual representation of a statistical process of agglomeration, the process of finding items that are closely related and then pairing them with other items that are also closely related. There are a number of different techniques for performing the agglomeration — they are variations on the beginning of a square dance, where the people who want to dance with each other most pair up first, and the foot draggers are added to the mix as the dance continues  – but the one I have used here is Ward’s minimum variance method. In the work I have done so far with the plays, I have found that Ward’s produces groupings most consonant with the genres of Shakespeare’s plays as we understand them critically. In this dendrogram, the different genres of Shakespeare’s Folio plays are color coded: comedies are red, histories are green, tragedies are brown and late plays are blue. The third item, a table, shows the sequence in which items were paired, listing pairs in order from the most similar to the least.

We learn a couple of things from this analysis. First, using the first two principal components was a good but not terrific way of grouping the plays into genres. If we were to be bowled over by this analysis, we would expect to see less intermixing of different colored items within the clusters and subclusters of the dendrogram. Using more than two components will provide better accuracy, as we will see below. Second, there is a reasonably intuitive path to be travelled from the patterns in the dendrogram to some explanation at the level of the language in the plays. I can, for example, ask why Love’s Labour’s Lost and King John look so similar in this analysis, and find an answer in the components themselves. This is what we did in the previous post, where I looked at a passage that expressed the first and second components in a typical, history play sort of way. Because the scatterplot is a graph and not a map, we need to interpret proximity of items correctly: items that collocate with one another possess and lack the two components being graphed in the same way. The same is true of The Tempest and Romeo and Juliet, but in this case, these plays possess a high degree of both principal component 1 and principal component 2: they combine lots of description with lots of first person and interaction.

Now look at the tradeoff we make when we take advantage of all the components that can be extracted using PCA. Using all seventeen components, we get the following dendrogram, again using Ward’s minimum variance method. (In JMP 8, I am performing Ward’s procedure without standardization on the PCs derived using the covariance matrix, the latter components being identical — as far as I can tell — to those I would have derived  in R using prcomp with centering = T and scale = F). The first dendrogram is color coded by genre, like the one above. The second color codes the plays from lighter to darker, the lighter ones being those composed earlier in Shakespeare’s career (according to the Oxford editors) while the darker ones are from later.

Dendrogram produced in JMP using all principal components, clustered with Ward

Dendrogram produced in JMP using all principal components, clustered with Ward

Dendrogram produced in JMP using all principal components, color coded according to time of composition

Dendrogram produced in JMP using all principal components, color coded according to time of composition (Oxford order)

Use of all the components provides much better separation into intelligible genres: notice how well the histories are clustering at the top of the dendrogram, while we find nice clusters of both tragedies and comedies below. And if we look at the two largest clusters — those linked by the right-most line in the dendrogram — we see that they are broadly separated into earlier and later plays (with the earlier on top, later below).

Nice result, and we’ll see some even nicer ones produced with different techniques in subsequent posts. But how do you work back from these images to a concrete discussion of what puts the plays together? Because we are dealing with seventeen components simultaneously, it is nearly impossible to make this properly interpretive leap. This is the fundamental paradox you encounter when you apply statistics to categorical data derived from texts: the more intensively you make use of mathematics in order render intelligible groups, the less you know about what qualifies items for membership in those groups. If we were running a search engine company like Google, we wouldn’t worry about explaining why items are grouped together, since our business would only concern producing groupings that make sense to an “end user.” But we do not work for Google — at least, not yet. There will be times when it makes more sense to visualize and explore manageable, critically intelligible levels of complexity instead of seeking the “perfect model” of literary experience.

Posted in Shakespeare | Tagged , , | Leave a comment

Local Versus Diffused Variation; the Hinman Collator

Hinman ControlsIMG_0563

Above are two images of the Hinman Collator currently residing in the Memorial Library at the University of Wisconsin, another optical collating device that uses visual comparisons to highlight minute differences between seemingly identical versions of the same text. Hinman used the device in his landmark survey of Shakespeare’s First Folio; it allows the user — here, paper conservator Theresa Smith of Harvard — to “see” differences between two items by merging them in a single image. Areas of difference appear as a kind of grey area — a more subtle effect, perhaps, than the hovering text that is produced by the Lindstrand Comparator discussed in the previous post. The device also allows you to toggle between the two editions you are looking at, making subtle differences standout immediately. Prior to creating this device, Hinman had worked in military intelligence comparing pre- and post-bombing aerial photographs: the collator is thus one of several adaptations of military skills and technologies for literary analysis.

Smith and her collaborator, Daniel Selcer (Duquesne), gave a fascinating paper at Wisconsin two weeks ago which dealt with differences among Facsimile editions of Copernicus’ De revolutionibus. Here Smith is examining two facsimiles of De revolutionibus with divergent diagrams of the center of Copernicus’ universe (the sun), which in one edition appears as a circular outline and in another as a solid sphere. The Hinman revealed the differences in the center of the diagram immediately, but it also revealed other areas of discoloration and streaking which suggested that the underlying manuscript (closely guarded in Poland) might not be adequately represented by the facsimile edition. One of the interesting points of Smith and Selcer’s paper was that you must treat facsimiles as artifacts in their own right; they are not always indexical transcriptions of an original, but in certain crucial ways iconic: the technologies used in producing the facsimile proper introduce artifactual effects that make the facsimile a likeness rather than an absolute trace-copy of the original.

This device and the one I wrote about in the previous post are technologies for the identification of local variants, which is to say, variants that occur in one place on the page. The search for such variants has been crucial in the history of textual scholarship. Hinman, for example, was able to deduce from variants among surviving First Folios the order in which the forms were printed and their various states of correction. This, in turn, led him to reconstruct an “ideal” Folio which he hypothesized contained the latest or most corrected state of the book. (No single Folio contained all of the corrections, since as Hinman argued, “every copy of the finished book shows a mixture of early and late states of the text that is peculiar to it alone.”) While presenting some findings from the work I have been doing with Docuscope at Loyola earlier this month, Peter Shillingsburg made a very important point in connection with this idea: when you are interested in broad patterns, it appears that it does not matter what edition you are working with. This may not seem like a big deal to some readers, but for Shakespeareans, it matters quite a lot which edition of the plays you are using to argue your case. There are substantive differences between different printed editions of the plays, and in some cases, individual words or phrases — for example, Hamlet’s “dram of eale” — can be emended to produce significantly different readings of a particular passage.

So, would the findings I come up with using Docuscope change significantly if I switched from the Moby Shakespeare (a nineteenth-century edition produced at Cambridge) to the Oxford or Norton? Yes and no. No in the sense that I am interested in a form of variation that is not accounted for in the kind of textual scholarship practiced by Hinman. In looking for genre at the level of the sentence, I am looking for diffused rather than local variation: a kind of patterned deviation from the mean that occurs across the entire body of the text rather than at one crucial intersection. So if Docuscope were able suddenly to read the word “evil” once an editor had emended “eale” in Hamlet’s speech, there would be a slight uptick in one of its counting categories (“Negative Values”). But that uptick would probably not fundamentally alter the patterns being discriminated across the entire corpus of Shakespeare’s works. There are some statistical procedures which could register slight upticks in categories that are not used frequently, however, correlating them with others that are exercised all the time. What if there is a correlation between even slight changes in Shakespeare’s use of “Negative Values” tokens and the much more common “Description” tokens we explored in the histories? What if, in other words, a “dash of x” matters sometimes?

I think it is important to recognize this category of the “dash” or “pinch” in looking for broader patterns of variation in large populations of texts, because it sits somewhere between the “crux” local variants like “eale” and the global variation we see in uses of “the” or concrete descriptive nouns. Because Docuscope is looking at things sub species aeternitatis, as is were, we cannot say that it matters when such dashwords are used. Time sensitivity in use, immediate context: these are all crucial features that help us understand local variants. (And we are quite attracted to local variants, as the history of literary criticism and close reading shows.) Dashwords are different: rare, like an eclipse, but nevertheless part of a globally diffused pattern.

Posted in Shakespeare | Comments closed

Pre-Digital Iteration: The Lindstrand Comparator

IMG_0529

I’ve just finished a terrific conference at Loyola organized by Suzanne Gosset on “The Future of Shakespeare’s Text(s).” This photo shows a device, used by one of the conference organizers Peter Shillingsburg, to perform manual collation of printed editions of texts. There is a long tradition of using optical collators to find and identify differences in printed editions of texts; this one, the Lindstrand Comparator, works on a deviously clever principle. Exploiting a feature of the human visual system, the Lindstrand Comparator allows you to view two different pages simultaneously, with each image being fed to only one eye at a time through a series of mirrors. When the brain tries to reconcile the two disparate images — a divergence caused by print differences on the page — these textual differences will levitate on the surface of the page or, conversely, sink into it. What is in actuality a spatial disparity becomes a disparity of depth via contributions of the brain (which is clearly an integral part of the apparatus).

In this photo, Shakespeare scholar Gabriel Egan compares two variant printed editions of a modern novel. The device is an excellent example of mechanical-optical technology being used to assist in the iterative tasks of scholarship — iterations we now perform with computers. It is also the only technology I know of that lets you see depth in a page, something you cannot do with hypertexts or e-readers. Maybe we should stop writing code for fancy web-pages and start working with wood and mirrors?

Posted in Counting Other Things | Tagged , | Comments closed

Edward III, Shakespearean Trigrams, and Trillin’s Derivatives

A bustling day for Shakespeare scholars, and those who follow computer assisted work in the humanities. The Times reported yesterday that Sir Brian Vickers has used a plagiarism detection software package to demonstrate that Shakespeare wrote Edward III with Thomas Kyd. Edward III was published anonymously in 1596 and has, since the eighteenth century, been associated with Shakespeare on the basis of stylistic and stylometric analyses. I haven’t seen any substantive writeup of Vickers’ conclusions, so I don’t want to pass judgment on the results. But the description we have of his proof so far is intriguing. Apparently the program, developed at the law school at Maastricht, produces a concordance all three-word sequences (trigrams) in a target text and then looks for these trigrams in other documents. According to Vickers, some of these trigrams are quite common and so shared by many language users. They represent common grammatical runs. But other trigrams are unique to a single writer and so are, in his words, a type of “fingerprint.” These author-generated trigrams (as opposed to those generated by grammatical constraints) tend to be “metaphors or unusual parts of speech.” Edward III, he claims, is seen to contain the author-fingerprints of both Shakespeare and Thomas Kyd when the play is compared on these terms with other writings by these authors published before 1596. The article goes on to explain that it’s usually around 20 grammar-based trigrams that are shared among texts (the common ones), but that 200 unique Shakespearean trigrams appeared in Edward III (in about 40% of the scenes), and “around 200″ unique Kydian trigrams appeared in the rest of the play.

This analysis prompted two thoughts. First, I was skeptical, since my work with Docuscope — which uses a mix of grammatical and semantic categories to class populations of texts — has not indicated that the program has any ability to “see” authorship in the texts it analyses. This may be due to the nature of Docuscope: it was designed to track genres, so its phenomenological categories (which span the range of semantic, grammatical and rhetorical effects) are not particularly good for fingerprinting. But let’s say that Vickers is right, and that something like authorship can be described in terms of unique three-word sequences. Exactly what kind of linguistic independence — and surely some independence from grammatical and genre constraints is presumed here — do authors possess? According to this study, it would be the ability to sequence short runs of language in a unique way, an ability that is confirmed by the fact that those three-word sequences which are duplicated in the works of other writers are few and merely grammatical. My first question, then, is this: since the machine itself is not tasked with sifting what is merely grammatical from what is metaphoric and so authorial (and remember, some of the unique passages are “unusual parts of speech”), on what basis do we discount certain shared runs as merely grammatical and therefore non-authorial?

Now I know the obvious reason why Vickers sees authorship at work here — there are only 10 to 20 trigrams shared between (a) the known pre-1596 works of  Shakespeare, (b) the target text (Edward III), and (c) the known pre-1569 works of Kyd, whereas there are 200 that belong to Shakespeare and certain sub-portions of the target text exclusively (a and b). The comparatively higher frequency of unshared trigrams to shared trigrams (200/20) suggests Shakespeare is the author of those portions of the play with the abundance of non-Kydian trigrams. But it is worth thinking about the difference between the two types of shared trigrams, since the (interpretive) characterization of this difference supports an underlying theory of authorship which is itself worth debating. In the world of contemporary literary criticism, the author can now just as easily be described as an institution, effect or persona as he or she can an empirical person. So do we now have empirical proof that the author as person (rather than institution, persona or effect) shows up on the page in the form of unique combinations of metaphorical words? Sounds like the old, romantic fashioner of images to me, although we have no comparative data on the number of times particular writers use unique trigrams and so exercise their combinatory, poetic imagination. Is Shakespeare more inventive (and so authorial) than say Kyd, but less than Jonson? We’ll have to wait and see.

My second thought is prompted by an editorial by Calvin Trillin in the NYTimes this morning. Trillin tells a story about a man he met in a bar in Manhattan who explained the financial meltdown to him. The man, a late middle-aged Ivy Leaguer who has done well for himself in life, says that Wall Street melted down because “the smart guys had started working on Wall Street.” (I stumbled over the pluperfect in this sentence, but as a Trillin fan, I can only study and learn.) He tells Trillin that in the old days, the really smart kids went on to prestige jobs as judges as professors, where their minds were exercised but they didn’t make a lot of money. The lower third of the class, whose academic performance was undistinguished, went on to careers in finance, where they made oodles of money and could afford homes in Greenwich. As the cost of an elite education grew, however, the really smart ones decided they ought to go make a pile of money before doing something more rewarding, paying off their college loans on their way out the door. “That’s when you started reading about these geniuses from M.I.T. and Caltech who instead of going to graduate school in physics went to Wall Street to calculate arbitrage odds,” the man at the bar says. The problem, he goes on to explain, is that these high-flyers were quants or math geniuses and — in addition to knowing how to manipulate a data set — realized fairly quickly that they could make even more money than the “lower third” had been doing in their Wall Street careers. It didn’t take long for the quants to invent derivatives and credit default swaps: from here on out, risk would be “quantified” in ways that no one could really understand, and Wall Street executives would now be free to pursue profits and risks with reckless abandon.

I have often wondered if the application of statistics to the study of literature is a bit like the creation of literary “derivatives.” There are millions and millions of patterns in texts — perhaps as many as their are neurons in the brain — which could be grouped into bundles and used to class texts according to this or that taxonomy. (“I’d like to bundle Shakespeare and Jonson futures today, please.”) The pitfalls in this kind of analysis are the same as those facing Wall Street investors who look for new ways to describe (and then commoditize) the phenomena they trade in. You can attach meaning to patterns in the data, but there is no guarantee that the pattern you are looking at represents something that you really understand. That is why I have tried to work with known quantities — the decisions of these people to call these texts comedies, for example — and then link them to mechanisms or rhetorical effects that I recognize (retrospective narration, two-person dialogue, seduction). I say this because there are a lot of ways, millions of them, to measure similarities between texts. This recent attribution of co-authorship by Vickers is ultimately based on a measure of similarity across passages in known and unknown works. But one always has to ask, similar in what respect?

Do we know that individuals use three-word sequences in ways that are so unique that they cannot be imitated, thus ensuring that patterns among these sequences are an unconscious signature of authorship? A similar theory has been advanced about painters, suggesting that individual artists can be identified by acts that they do not consciously attend to, such as the sketching of ears and hands. Carlo Ginzburg has written interestingly about this, and there is a fascinating discussion of a similar technique (the “courtesan method”) in Orhan Pamuk’s novel, My Name Is Red. But of course, there might be other ways of measuring similarities and differences — other things to count — that would not pick out these two authors as decisively. My point is that the presumption that these are the things that need to be counted — trigrams — as opposed to the wealth of other countable things is based on interpretation rather than observation. Such a choice implies a theory of authorial behavior that should itself be tested over a broader range of texts, suggesting as it does that authorship can be “conclusively” measured and assessed as a behavior (i.e., it is using language in a unique ways) rather than as a feeling in the reader. And it implies a distinction between discountable overlap in such behaviors (because they are grammatical) as opposed to significant ones (because they are poetic or metaphoric) that should itself be explored further. For contrast, I would point out the work of Matt Jockers at Stanford, who has shown that combinations of extremely common words can provide clues about an author’s identity. Vickers’ author-tracking-trigrams are unique, and it is their uniqueness that gives them value, whereas Jockers’ “most frequent word” analyses (of words such as I, me, of, the, etc.) assumes that it is the common things that really “take” the imprint of the author’s individual nature.

All of this is good, I think, for the statistical study of literature, and for the study of literature more generally, since it forces us to ask basic questions about what authors do. Who knows, maybe Vickers is right. The more interesting question is, “Why?”

Posted in Shakespeare | 3 Comments

Rhythm Quants: Burial, Click Tracks, Genre Tempo

Graham has posted a new video by one of my favorite artists over at Object Oriented Philosophy. Burial is a London DJ whose work often gets filed under the label “dubstep,” a variety of post-house electronica that appeared several years ago. I like dubstep a lot, and this video actually captures something of its unsteady, city-worn appeal:

One of the greatest things about Burial is that his beats are asymmetrical. That is, in a world where you can loop beats in such a way that the “ictus” (ideal musical point where the beat falls)  is evenly distributed across the entire snippet, Burial’s beats sway a bit from tempo and then rejoin when the loop starts over again. I tend to hear this because I am a drummer, and was trained to play in the 1980s, just when drum machines were becoming more common in live performance and studio recording. For drummers who learned to play in this period, we were forced to synch our bodies (and eventually, minds) to a mathematically precise representation of the ictus — one that is produced by a machine — so that our own playing would match up with that of others who were similarly keyed into this “reference beat.” Most often, that reference beat would be calling the changes in synthesizer parts (which were electronically triggered by that reference): so the whole band, or the band in the recording studio, would ideally be vibrating to the same periodic oscillation, one that never changed unless the beat frequency was altered by the programmer or producer.

But of course, drumming is more fluid than this kind of matching to the mathematical ictus. Most dance music — music that people actually dance to — has subtle movements ahead of and behind the beat. This occurs in part to create musical tension, but also to whip dancers around in the right way. (Our bodies may exhibit symmetry, but our dance steps do not.) The most extreme versions of this kind of dance-wobble that I have witnessed, although not directly related to drumming, occur in European music. Hearing an orchestra play Strauss in Vienna, I was initiated into something that the Viennese take for granted: Strauss rushes the 1-2 in the 1-2-3 of waltz tempo, which means that you get a one-two…three, one-two….three in which the second beat does not evenly divide the first from the third. Hungarian and Romanian folk music has some of this as well. I remember being at a dancehouse in Budapest in the eighties and hearing a Roma folk band play, and was amazed at the quick surges and retards in the tempo, occurring at every measure. This variation, I was told, helped the dancers whip each other around so that their bodies could lean at the appropriate moment: a really beautiful idea, since it suggests that the music itself was conforming to the movements and weightings of the dance — even at the level of tempo.

If you look at the beginning of the Burial video, you can see the idea of symmetry taken apart on the screen, as the diagonals display action in a kind of dance-box. Movements and pans in and out of the paired boxes does not occur at the same speed, which means that you get the same kind of staggered synchrony that often occurs in Burial’s musical beats, but here it occurs visually.

I suspect that a good studio engineer could actually quantify the ways in which Burial’s beats redistribute the ictus on a measure by measure basis, something that was once done by drummers who were not playing to a “click” or mechanically measured metronome, but perhaps more intuitively and communally. That’s not to say that Burial has recaptured the “fluid” nature of the beat or that the electronic metronome killed the beat (and that Burial is bringing it back). It’s not that simple. Rather, drummers have always had a good sense of what the “ictus” is and have manipulated it implicitly by speeding up and slowing down before the beginnings and endings of measures. In a pre-click track world — listen, for example, to some of the beats by The Meters — you wouldn’t necessarily notice the manipulations, because the world has not yet learned to “hear” the absent click, which happens once music everywhere is keyed to an inaudible metrical yardstick. I would say that this was the case by the early ninetees. But once this implicit beat becomes part of the music — part of the bodies and ears of drummers and listeners alike — the tempo pushes and pulls are audible as deliberate. The drummers Manu Katché and Omar Hakim have made an art form out of this over the last two decades. I’m sure both of them can play to a click track (or not) in their sleep.

The point here is that human beings are exquisitely sensitive to quantitative phenomena like rhythm, and they can also have their background perceptions of what “proper rhythm” is shaped by the music they encounter. There is a backbeat or hidden track to music that is cultural, but that is confirmed or shifted with each performance. I suspect genre works in the same way — as a set of constantly shaped expectations — and that in some cases tempo has been keyed to certain arbitrary or regular standards in order to create particular effects. Serialization might be one version of this (something my colleague Susan Bernstein and I are working on), or the partitioning of plot around commercials.

Posted in Counting Other Things | Tagged , , | 4 Comments

Keeping the Game in Your Head: David Ortiz

I’m not a huge baseball fan, but I did grow up in the suburbs of Boston and so like the Red Sox. Over the weekend I saw a story in the Times about David Ortiz, who went from being a fabulous home run hitter to someone who couldn’t really connect with the ball and so lost his place at the top of the Red Sox batting order. Baseball is now loaded with information, as anyone who has followed the career of Nate Silver will be aware of. (Silver established his reputation as a baseball statistician but then went on to predict congressional and presidential elections at fivethirtyeight.com.) Apparently Ortiz was drawn into the game of studying his own performance “by the numbers,” and eventually it got to his game. Only when he decided to play for the “fun” of it did his hitting power return. As a story about a player’s encounter with statistics, this one has four parts: talented hitter does well; talented hitter attempts to improve performance with statistics (reported in the Times here); talented hitter suffers from overthinking his game; talented hitter learns to play the game again by forgetting about the numbers.

Perhaps this story is useful for thinking about the nature of statistically assisted reading. I’m not saying that using statistics to explore textual patterns drains the joy out of reading: it doesn’t, because the statistical re-description of texts is not reading in the sense that you or I would practice it. But I have had interesting experiences reading texts after I have learned something about the underlying linguistic patterns that they express. For example, when I learned that Shakespeare’s late plays contain a linguistic structure in the form of “, which” [comma, which] that distinguished them from all other Shakespeare plays, I really started to pay attention to these in my reading. I wouldn’t say that this detracted from my ability to read the text; rather it drew my attention to something else that was going on. But I also noticed that it was nearly impossible to pay attention to the linguistic patterns and to experience the meaning of that pattern at the same time. That is, I could either notice linguistic features of a play (presence of pronouns, concrete nouns, verbs in past tense, etc.) and ask why they were being used in a particular scene, or I could float along with the spoken line, feeling different ideas or emotions eddy and build as the speaker developed an image or theme. But I couldn’t do both.

Why should there be this “Ortiz effect” in reading? Is there some kind of fundamental scarcity of attention that forbids one’s reading as a (statistically assisted) linguist and as “any reader whatever” at the same time? I’m interested in this division, but skeptical of the idea — advanced in the article about Ortiz’ return to greatness — that you can forget what you know and “just do it.” The Times article says that Ortiz became a better hitter when he learned simply to “play…as if he were a boy.” But reading is never this simple: you can’t completely forget what you know, even if you learned it through the apparently foreign procedures of statistical analysis. Perhaps you can read “as if” you didn’t know it, and then re-engage that knowledge to examine how the linguistic patterns produce the effects you’ve just experienced? My point here is that readers who are assisted by statistics must simultaneously be both versions of Ortiz described in the different articles: both the hitter and the thinker. It would be a mistake to think that “natural” reading is accomplished in a state of child-like absorption in the game, since even children are brimming with strategies and inferences. I am glad to know certain things about Shakespeare that I couldn’t have known without the assistance of statistics — like the fact that the Histories are full of concrete description and a lack of first and second person pronouns. This doesn’t interfere with my game (I hope), but shows me that the game can be played on another, as yet unknown, verbal plane.

Posted in Counting Other Things, Shakespeare | Leave a comment

Four-Syllable Rock n’ Roll

Certain things can be counted without a parsing device, for example four-syllable words in rock n’ roll songs. I have often wondered why there are so many one syllable words in rock songs, and have a pet theory for this. Rock lyrics favor Anglo-Saxon words rather than Latinate words — the former have a more direct, less fussy sound — and since the Latinate words tend to be multi-syllabic compounds, multi-syllabic words (say, more than three syllables) tend to be very rare in rock music. Why exactly the monosyllable is appropriate to rock is something I cannot explain, although it may be related to another pattern I have observed: countries that underwent the Protestant Reformation seem to be the most adept at producing (not necessarily consuming) rock music, particularly heavy metal. Perhaps there is a connection here between Northern European linguistic practices (and the persistence of Anglo-Saxon forms) and the predisposition to religious violence in the sixteenth and seventeenth centuries, one that prepares these countries for immersion in a subsequent musical form like rock n’ roll.

In any event, I’d like to know what the longest Latinate word is that has been successfully used in a rock song. My candidate (based on popularity, not length) would be “satisfaction,” as in, “I can’t get no satisfaction.”

Posted in Counting Other Things | Leave a comment