Sunday, July 31, 2011

Predictable writing?

I have heard of this before. It is curious in that students papers could be analyzed to make sure it is authentic of the student's style. Big Brother is closer than you think.

"The Jargon of the Novel, Computed"


Ben Zimmer

July 29th, 2011

The New York Times

We like to think that modern fiction, particularly American fiction, is free from the artificial stylistic pretensions of the past. Richard Bridgman expressed a common view in his 1966 book “The Colloquial Style in America.” “Whereas in the 19th century a very real distinction could be made between the vernacular and standard diction as they were used in prose,” Bridgman wrote, “in the 20th century the vernacular had virtually become standard.” Thanks to such pioneers as Mark Twain, Stephen Crane, Gertrude Stein and Ernest Hemingway, the story goes, ornate classicism was replaced by a straight-talking vox populi.

Now in the 21st century, with sophisticated text-crunching tools at our disposal, it is possible to put Bridgman’s theory to the test. Has a vernacular style become the standard for the typical fiction writer? Or is literary language still a distinct and peculiar beast?

Scholars in the growing field of digital humanities can tackle this question by analyzing enormous numbers of texts at once. When books and other written documents are gathered into an electronic corpus, one “subcorpus” can be compared with another: all the digitized fiction, for instance, can be stacked up against other genres of writing, like news reports, academic papers or blog posts.

One such research enterprise is the Corpus of Contemporary American English, or COCA, which brings together 425 million words of text from the past two decades, with equally large samples drawn from fiction, popular magazines, newspapers, academic texts and transcripts of spoken English. The fiction samples cover short stories and plays in literary magazines, along with the first chapters of hundreds of novels from major publishers. The compiler of COCA, Mark Davies at Brigham Young University, has designed a freely available online interface that can respond to queries about how contemporary language is used. Even grammatical questions are fair game, since every word in the corpus has been tagged with a part of speech.

Suppose we’re interested in looking at past-tense verbs. The most common examples in COCA are nondescript: “said,” “came,” “got,” “went,” “made,” “took” and so on. On the surface, the fiction offerings aren’t that different: “said” is still the big winner, while some others move up the list a few spots, like “looked,” “knew” and “thought.” But ask COCA which past-tense verbs show up more frequently in fiction compared with, say, academic prose, and things start to get interesting: the top five are “grimaced,” “scowled,” “grunted,” “wiggled” and “gritted.” Sour facial expressions, gruff noises and emphatic bodily movements (wiggling fingers and gritting teeth) would seem to rule the verbs peculiar to today’s published fiction.

Beyond the use of individual words, researchers can uncover even more striking patterns by looking at how words combine with their neighbors, forming “collocations.” Dictionary makers take a special interest in high-frequency collocations, since they can be the key to understanding how words work in the world. It’s a particular boon for making dictionaries that appeal to learners of English as a second language. When the lexicographer Orin Hargraves was studying collocations for a project at Oxford University Press (where I previously worked as editor for American dictionaries), he struck upon a trove of collocations that “would not be statistically significant were it not for their appearance in fiction.” And these weren’t just artifacts of genre fiction, like “warp speed” in sci-fi or “fiery passion” in bodice-ripping romance novels.

Using the Oxford English Corpus, encompassing about two billion words of 21st-century English, Hargraves found peculiar patterns in simple words like the verb “brush.” Everybody talks about brushing their teeth, but other possible companions, like “hair,” “strand,” “lock” and “lip,” appear up to 150 times more frequently in fiction than in any other genre. “Brush” appears near “lips” when two characters’ lips brush against each other or one’s lips brush against another’s cheek — as happens so often in novels. For the hair-related collocations, Hargraves concludes that “fictional characters cannot stop playing with their hair.”

“Bolting upright” and “drawing one’s breath” are two more fiction-specific turns of phrase revealed by the corpus. Creative writers are clearly drawn to descriptive idioms that allow their characters to register emotional responses through telling bits of physical action — “business,” as they say in theater. The conventions of modern storytelling dictate that fictional characters react to their worlds in certain stock ways and that the storytellers use stock expressions to describe those reactions. Readers might not think of such idioms as literary clichés, unless they are particularly egregious. Individual authors will of course have their own idiosyncratic linguistic tics. Dan Brown, of “Da Vinci Code” fame, is partial to eyebrows. In his techno-thriller “Digital Fortress,” characters arch or raise their eyebrows no fewer than 14 times.

Brown’s eyebrow obsession may simply signal a lack of imagination, but corpus research can also illuminate a writer’s stylistic creativity. Masahiro Hori, a professor of English linguistics at Kumamoto Gakuen University in Japan, has studied how Charles Dickens breathed new life into literary collocations. In “The Pickwick Papers,” for instance, Dickens played off the idiom “to look daggers at someone” (meaning to shoot a wrathful glare, itself descended from Shakespeare’s “to speak daggers”) by innovatively replacing “daggers” with “carving-knives”: an old lady “looked carving-knives at the hardheaded delinquent.” To be sure, a careful reader might have discerned the originality of the phrase on his own, but corpus analysis allowed Hori to confirm and extend his insights into Dickens’s ­originality.

For David Bamman, a senior researcher in computational linguistics with Tufts University’s Perseus Project, analyzing collocations can help unwrap the way a writer “indexes” a literary style by lifting phrases from the past. Often this can consist of conscious allusions — Bamman and his colleagues used computational methods to zero in on the places in “Paradise Lost” where John Milton is alluding to the Latin of Virgil’s “Aeneid.” Though traditional literary scholarship has long sought to track these echoes, the work can now be done automatically, transcending any single analyst’s selective attention. The same methods can also ferret out how intertextuality can work on a more unconscious level, silently directing a writer to select particular word combinations to match the expectations of the appropriate genre.

When we see a character in contemporary fiction “bolt upright” or “draw a breath,” we join in this silent game, picking up the subtle cues that telegraph a literary style. The game works best when the writer’s idiomatic English does not scream “This is a novel!” but instead provides a kind of comfortable linguistic furniture to settle into as we read a novel or short story. While Twain, Hemingway and the rest of the vernacularizers may have introduced more “natural” or “authentic” styles of writing, literature did not suddenly become unliterary simply because the prose was no longer so high-flying. Rather, the textual hints of literariness continue to wash over us unannounced, even as a new kind of brainpower, the computational kind, can help identify exactly what those hints are and how they function.

[Ben Zimmer is the executive producer of Visual­ and and the former On Language columnist for The Times Magazine.]

Thanks to POSP stringer Tim.

No comments: