Automated Content Authorship

beaverswimmingΤεχνίτη Νοημοσύνη και Ρομποτική

14 Νοε 2013 (πριν από 4 χρόνια)

77 εμφανίσεις

Automated Content Authorship


While I have been work
ing on this project for about 16

years now,

and publishing for
about 8 years,

the news of the patent
(September 2007)
set off a chain reaction

in the
press. Ambiguous
or terse
writing (probably due to spa
ce limitations) has lead

to a bit
of confusion.
I will start with a
“Myth” report:



Myth #1

-

I am working on r
omance novels




I have a
method
for doing so, but find

that this area is not a top priority
for
automation (given the genre is covered
by so m
any authors already)
.
Of all the
forms of

fictional

literature (beyond poetry)
,

romance novels
may

be the most
formulaic
(
or at least some of the sub
-
genres
)

as
many titles

follow
established
rules (by époque, level of explicit language, etc.) and therefor
e
may

be r
everse
engineer
ed

and automate
d
.

There are, of course, books dedicated to
deciphering formulas within this genre (e.g.
Complete Idiot's Guide to Writing
Erotic Romance).

One journalist correctly quotes me (Shellie Karabell):


But he’s not lookin
g to replace authors of fiction


even best selling
fiction. “I would never try to programme a computer that would write
the Harry Potter series,” he says. “The amount of time needed to write
a programme would be longer than the human writing the book.”


I

have explained to journalists
that such a project would be very time
consuming
, and, in my opinion, would not produce titles of much use to
anyone

(not withst
anding a potential

glut of fiction titles, at least in the
English language)
.
M
y
current
project
s are

reference and
educational
materials (in numerous languages),
web
-
based

educational materials

(my
online dictionary) and
educational
video
animations
. Romance novels
authored by computers
ar
e an R&D project for the future, and

may never see
the light
of day

(
at least from me).



Myth #2



the book is created after someone orders.


In a Guardian article, the author writes “
Nothing but the title need actually
exist until somebody orders a copy.

At that point, a computer assembles the
book's content and
prints up a single copy.
” This

statement

may be
misleading
.

All titles are created in advanced, vetted, and supplied to
distribution well in advance of any order.
No title exists on a distributor

s site
that does not exist in its entirety first (i.e. no bo
ok is written on demand);
financial gap analyses are updated if needed before being sent to the user
, but
an original pre
-
exists
.
Only the printing in paper format is done upon request.
Some 95% or more of
the
titles are sold electronically (through channe
ls not
used by the general public, such as netLibarry, OCLC’s netLibary,
MyiLibrary
-

Ingrams, EBSCO, marketresearch.com, etc.).
PDF versions of
each title are sent to distributors months in advance before they appear on
sites.
Amazon only recently reques
ted that they carry my business titles as
they
expand to
business segments. The automated business titles have been
published since the year 2000 and are purchased by central governments, large
multinationals, banks and businesses with imports/exports.


My
th #3



I consider myself

to

the most published author.


Actually, I was fist introduced as such by a dean at INSEAD to a visitor. Later
a spokesperson from Amazon.com

was quoted in BusinessWeek:


"He may be the most prolific author in history," says Amazo
n's Kurt Beidler.


Later I was introduced as such in a public seminar by
the moderator
. To none
of the persons above, nor to my friends or colleagues have I positioned myself
in these terms (in fact, most of my colleagues at INSEAD
, except our librarian
a
nd others in my department,

had no knowledge

of
or were surprised that I
was working on this topic).

On a few occasions
I

have jokingly
mentioned the

q
uotations
.

This claim, however, is a bit off the mark, given the fact that I am
working on reference and
educational titles, many of which are in report form.
Amazon lists these titles in their “book” section, but they have nothing in
common with novels

(
considered
by many
to be
the only legitimate form of
“authored” work
).

The book industry, of course, has m
any types of books. The
ones I work on are of very particular genres. One journalist (Asher Moses,
from Sydney), uses wording I would prefer:



But Parker doesn't describe himself as an "author", and he's far from the creative
type. Rather, the US
-
based pr
ofessor of management science at INSEAD business
school has developed and patented algorithms enabling computers to write books for
him.


My patent uses the word “title material”. For this I am certainly not the most
prolific (e.g. compared to the U.S. Gov
ernment).



Myth #4



I do

not announce that the b
ooks are generated by computer.


We need to distinguish
across genres.
One article is correct in saying that
nothing announces that the healthcare titles are computer generated. I do not
announce this since

the
text is not written by a com
puter. All of the
text is
written by professionals. The computer does not scour the internet
automatically summarizing the results into a book. Some internet sources are
cited


given that this is an internet guide series.

Only the formatting and
indexing, lists, front matter
, lists,

etc. are done automatically (this
copy editing
saves a
substantial amount of time). The health series happens to be about
“how to use the internet” so the confusion is understandable. It
might
be
misleading to mention that
computers
were used
, when applied to the
formatting (e.g. “the table of contents of this book is computer generated by
indexing it to Heading1 settings in a Word document …”). When a newspaper
uses software to format columns,
they do not announce this to the reader.
Interestingly, the one series I have worked on where humans wrote 99% of the
text is the one where some readers gave some titles a low rating in Amazon, or
were willing to believe that a computer wrote the text (in
contrast, some
patient associ
ations recommend and/or send
the guides to their members). Of
course, it is normal that some titles have high and some have low ratings when
one publishes hundreds of titles in a subject area. All the health titles
and
medical
text was

vetted by professionals (e.g. medical librarians, forums, etc.).
The series was created when medical libraries started “internet training
services” and were seeking guides that were disease specific. In that series I
am listed as an editor, not a
uthor. All contributed text is cited, and all
organizations were contacted to obtain permissions for quotations. All
passages were hand cleaned and edited by human editors. In the genres where
99+% is wr
itten by computers, there are few if any

n
egative com
ments (i.e.
computer authored titles

score a h
igher average rating than human written and
edited books

in this simple case
-

not enough evidence for a academic test

however to make any stro
ng conclusion
s
); the

business, crossword, and
classics titles are t
hinly

sold via
A
mazon
-

but through more traditional
business and/or direct to library channels, and are greatly appreciated.


In the
genres where
computers
algorithms
create

original content or results
(which is what my patent covers), this is described i
n the methodology of each
book by using language like “an econometric model is applied, …” I do not
say that a computer was used to do the econometrics as this would appear silly
to the reader (rarely, if ever, do people calculate the sum of squared error
s by
hand).
The non
-
econometric or formulaic sections are written by me (by
hand).
The reader expects a computer to be used

for the
formulaic aspects of
these titles

(e.g. in the trade reports where 90% of the titles are tables and
charts).


Finally, in t
he crosswords

and classics
, I do not announce they are computer
s
are used
because I never really thought about it


does it really matter?
The
computer is nested with

a very specific editorial
and/or linguistic
logic, that
makes the
titles useful

of use to

non
-
English speakers

(graph theory is used in
these genres


a potentially distracting topic)
.
I find

this issue

to be an
interesting question. We watch movies, and yet no
-
one

pre
-
announces to

the
viewer that productivity enhancing tools like
spell checke
rs, or
Ad
obe
Premier
, Avid,

or Maya are used
.



Myth #5



My authoring

programs

dynamically
scour the Internet and compile the
results in a book

(as auto
-
bloggers do)
.


Neither my YouTube video nor the New York Times article (or others)
mentions this expli
citly (the Times article is suggestive). Bloggers extrapolate
this conclusion from words like “data bases” and/or “public information
sources.” Well before the Internet, people wrote crosswords, performed
economic analyses, and wrote poetry from “public in
formation sources” (e.g.
times serie
s, and word lists/dictionaries)
. My computers do so “off line” by
mimicking, for example, economists or poets. There is simply no need to use
the Internet. None of my applications dynamically grab things off the interne
t
based on a Google search and throw them into a book. For example, for the
econometric studies
,

INSEAD purchases large quantities of
source
data not
available to the public over the internet
, and which existed since the 1980s and
distributed via data tape
s, or now, via DVD

(I’ve mentioned to journalists
trade organizations, the IMF, etc., as economists do in practice).
All
applications are database driven (some store links to the internet, if the subject
of the book is the Internet itself; these were not
scraped from Google or other
search engines). I
have
also
amassed over the last 25 years a very large multi
-
ling
ual lexicon, among others, some of which
I have posted for public use
(
www.websters
-
on
line
-
dictionary.org



again
, the pages, and editing, of which
are

generated via automation programs).
Here are reviews of my dictionary
which was created in started in the early 1990s, and launched off
-
line in 1999:



http://www.websters
-
online
-
dictionary.org/credits/editor.html



I
think

the Inter
net
scraping approach
, however,
is
fruitful in terms of
generating new knowledge and knowledge structures. In today’s age, people
ass
ume that if data are
involved, there can only be an I
nternet
approach
. My
automation project

began wel
l before browsers were invented, using data
available before the Internet age. The Internet, however, does allow genres
that could not have existed otherw
ise (e.g. guides to the Internet).



Myth #6



the programs simply copy and paste pre
-
existing information


The programs can do this, and will do so for some genres for some
limited
sentences
or sections

(such as boilerplate)
, but the vast majority (i.e. t
he 200,000 titles) do not
do so (the output is “calculated on the fly”)

for the bulk of the pages
; i.e. titles cannot
violate a copyright as the output is wholly original, and mimic my thought process as
an econ
omist or linguist (i.e. I use

a
very
fast pen
). This results in titles whose
contents do not pre
-
exist in
any
databases, nor can be found on the Internet. I am
working on series that can combine pre
-
existing data with original content (as some
genres are designed to be this way).
What are the “algori
thms”? They are
generally
mathematical

approaches that I have found

well suited to specific genres (or sub
-
genres).
Here is a summary of the methods used (
please refer to
Wikipedia.org for
definitions

of unfamiliar terms
)
, posted with my
YouTube video
:


The "algorithms" depend on the genre. The most advanced use parametric,
non
-
parametric as well as Bayesian econometrics, graph theory, and meta
analysis (mostly coupled with some specialized computa
tional linguistics and
editorial rules that are required within certain genres)
--

each piece is rather
straight forward; the combination allows complexity. In terms of IT or
programming languages, there is no rigidity to this
-

again it depends on the
gen
re. If animation is the goal, then code is written to write MEL scripts, etc.,
which can automate Maya, which can in turn automate rendering, lights, etc.,
via macros. This works well, but for only certain aspects of that genre.


Some titles are 98 to 100
percent computer automated (e.g. business titles,
crosswords, etc.). For health titles, only the format editing and production side
is automated. The text in the health books was written by medical
professionals and edited by a professional editor; the com
puter expedited
formatting using about 50 odd routines (the preface, chapter intros, glossaries,
indexes, headings, margins, etc.); highlights are made to sources generally not
known to internet
-
averse readers or medical practitioners (designed for
medical

libraries with internet training services).


Currently, some 2 percent of the titles rely on government sources for text.
None perform a google search, spider the net, etc. Some 98 percent of the
titles are wholly generated via automation programs; the ap
plications create
original information or content that cannot be found elsewhere (e.g. maximum
likelihood trade estimates, latent demand forecasts via a decision calculus
approach, Chinese and English crosswords, etc.)
-

offline applications with no
intera
ction to the internet. In total, there are about 17 genres created this way
(about 200,000 titles or so since 2000).


It can take several years to set up an application (including all human inputs,
licensed sound effects, textures, models, mocap, data, or
decision rules that go
into any genre
-
specific application). Platforms (e.g. Maya) pre
-
exist. The
incremental, or marginal creation time per title is mentioned in the video.


The genres are blind or peer reviewed and/or vetted by users (e.g. librarians or
end
-
users) before they are put into print. The games are played by kids to see
what they like. For 3D games, a pre
-
existing rendering engine is like a blank
word document. The rendering engine is not created from scratch, but licensed
(like MS Word).


I am

mostly now working on education titles for Asian, African, and Native
American languages that do not have educational materials (games,
supplements, texts, videos, mobile phone books, etc.) written in or augmented
by their languages. See my dictionary at:


http://www.websters
-
online
-
dictionary.org


A
very small percent of the linguistic material used

is posted
. Watch for a
major update and linguistic augmentation to the dictionary this summer when
I
will also be introducing EVE. She is an "economically viable entity". A step
beyond a chat bot, using some of the algorithms mentioned above (with a bit
of utility theory and optimal control theory thrown in).


There is no "commercial" or "public" or "op
en source" software that can be
used by the general public. Some applications are terabytes large. I am
working on a relatively small poetry application for public use
--

to be
released when completed (probably in a year), which will do several forms of
po
etry, on any topic the user desires; and allow the user to request "another" if
they do not like the first one written, or "change that line", etc. The following
are samples of grammatical acrostics, practiced in elementary schools to
introduce children t
o poetry (title is an acronym for words in the poem):


NUDE

Naked unclad, dear enactment.


LOVE

Lean of vile emotions.


GOD

Gentlemen of divinity!


BOOK

Bible ordered, obtained Koran.


The application for this genre uses graph theory (clique commonality) a
nd
over 40,000 grammatical structures, ranked by meta
-
analytic probabilities of
being understood by English readers.

There are many other areas I am
working on, as there are multiple avenues to explore, especially in the areas of
new media (mobile and fixe
d), but more so in high
-
end analytics and
knowledge discovery (i.e. generating knowledge that could not be created
otherwise) as applied to business, language and public services (e.g.
criminology)
-

where unmanageable, sparse, disintegrated or larger data

sets
(off
-
line) result in new knowledge structures usable by decision makers (e.g.
connecting the dots where humans have difficulty doing so, f
or lack of time or
expertise).



Myth #7



it costs 12 cents to create a book


This figure (or similar numbers)
reflects the marginal cost. The full or average
cost is much higher. The set up cost for an application can be hundreds of
thousands of dollars


costs that may no
t

be recovered over the life of the
genre. This is true for both electronic and non
-
electroni
c versions.


Concluding Remarks


The most interesting aspects to me about this project is what can be achieved by it. To
date, journalists

have not covered

this angle. I think what I have done thus far is
extremely modest, and many other applications can

b
e developed
, especially in genres

that involve highly repetitive writing methodologies, or

that lack the economies to be
created otherwise (languages or topics that are obscure to
most
, but critical to others).

Here are a few comments from around the Inter
net by people who see this potential:




In a way, humanity can be defined by what it is that humans can do that
machines can’t do. That boundary is continually being pushed further, and in
coming years we will need to move to increasingly complex and imagin
ative
tasks of synthesis and creativity that computers cannot do. Philip Parker, a
professor at INSEAD, is probably doing more than anyone else to push this
boundary.

In many cases the market is too small to justify a person writing
the report. However t
here is no question that a significant part of an analyst’s
work can be automated. The boundaries of human value are being pushed
further, and this is just the beginning.

Ross Dawson



As [his]
video

demonstrates, many of his works are economic or market
analyses and forecasts, but he also uses the technology to write about obscure
medical topics


both genres that he’s able to succeed in because they are
under
served by traditional authors.

Scott D. Anthony



It's a fascinating subject and it calls into question many of our assumptions
about writing and research
.
This guy is part of a movement that
is doing to
office workers what the industrial revolution did to blacksmiths.

Daryl


To be fair, here is the other side of the coin:




Philip Parker has won today’s “Worst Person in Publish
ing” award. I wanted
to give him the “Worst Person in the World” title, but, well, I’m fairly certain
that’s been copyrighted. Hmm, maybe the ”’New York Times”’ will share the
honors, if only due to its continued lack of critical thinking when it comes to
covering books and publishing. … Likewise, I am not sure that the ”’NYT”’,
as close an industry
-
town publication as possible, is capable of writing about
the publishing business with clear
-
eyed intelligence. Kassia Krozser



Fire the monkeys! Return them to
their happy habitats! Our genre of choice
will be written by
GLaDOS
, and other AI computers, because there’s only “so
many body parts” about which to write a romance. SB Sarah



Act
ually Parker is providing a rather useful service for those who understand
the limits of his “books.” I just hope that the “Make Money Fast” crowd
doesn’t catch on too quickly to the possibilities here and come up with yet
another product category to push
through e
-
mail and blog comment spams. As
for Amazon, I wouldn’t mind a filter to separate Parker
-
style books from the
purely human
-
done variety. Meanwhile perhaps Parker and his machine
-
aided
crew can go on to write a coping guide to for victims of techno
logy.
David
Rothman



Mr. Parker is an “author’ only in the loosest sense.
Jane



Should authors be worried? Probably not, at least not yet.

There's a wide gap
between what a computer can compile and the nuanced hand of a skilled artist.
Still, this news is a bit unsettling to those employed in the creative arts. And,
taking the music industry as an example, it doesn't seem well advised to
und
erestimate this sort of development. It's the kind of trend that could as
easily become a dead end as an overnight sensation. Either way, it's worth
consideration.
Nathan Denny



He also says, "'My

goal isn’t to have the computer write sentences, but to do
the repetitive tasks that are too costly to do otherwise.'" That has me really
baffled. Aren't romances composed of many, many sentences? Fortunately I,
having endured this sort of ignorant notion

of romance novels for twenty
years, have learned to calm down and carry on relatively quickly.

Margaret
Moore



His ignorance [about romance novels] is almost embarrassing.
Kimber Chin



The
London Times
has pointed out one Philip M Parker who has created over
200, 000 titles (albeit mostly statistic books from what I can see)

using print
on demand technology. But the worst part is that, by his own admission,
automation produced a large part of his works. And he’s planning to move into
romance novels and poetry. that’s what freaks me out. No matter how
formulaic either genre ca
n be, in the most juvenile hands, it is still something
human. The idea of automated poetry makes my skin crawl.
bookology.wordpress.com



… it’s now possible to foresee a literary future in which human intervention is
no longer required.
Michael Moran



The b
est publishers are focusing on
building large growing communities
.
Content is becoming a commodity
, as
content without subscribers is worthless
.
As failing mainstream publishers follow in Mr. Parker's footsteps, small
publishers stand no chance to compete

unless they have an army of brand fans.
Aaron Wall



I guess the automated content may look good enough to look real, but the
talent is something more than that. I think such automated tools are a threat to

everyone who publishes mediocre content though.
bobby_handzhiev



Won't the advent of programmes like this enable more small publishers to
produce content?

I think this will

drive the premium on quality original
content higher still. However, long term (maybe 20 years +) perhaps AI will
have reached the point where it can start drawing its own conclusions. Then
we really become redundant!

And who will be leading the way with
AI?
Perhaps the company collecting huge amounts of data of every aspect of our
lives? Google.

BenCo



"if you are ever stuck for an absorbing read don’t forget "The 2007
-
2012
World Ou
tlook For Bridges, Crowns, Dentures and Other Orthodontic
Appliances That Are Customised For Individual Application on a Prescription
Basis"
Roland Dodds




we hope someone sent from the fu
ture destroys these robot authors


partly because we don’t want to be destroyed by the machines, and partly
because we are pretty well out of robot
-
war jokes. But we'll do what we have
to if more news comes along


because, while we may run out of punch
l
ines… [
adopts growly, inspiring Bill Pullman voice
]…

we'll never run out of
hope.

Ben Mathis
-
Lilley


What is my take? I think that the most useful applications will be created for genres
t
hat are
so complex or labor intensive, that automation is
almost th
e
only viable
approach.
That being said, w
riting hundreds of
original

high
-
quality Ph.D. theses will
be easier to accomplish using this approach, than writing a single
creative

and high
-
quality children’s story (given the lack of formulaic sub
-
genres that
can be reverse
engineered). “Human creativity” in this sense is the absence of formulaic authorship
techniques that can be reverse engineered. Some Ph.D. theses, and forms of poetry for
that matter, are not that “creative”.
Creative

authors,
journalists,
e
ditors,
report
writers, manual writers, script writers, or bloggers,
therefore, need not fear
ever
being
replaced by this process. The same is true for creative doctoral students,
moviemakers, television producers or PC game makers.


Then w
hat does origina
l mean? From a pragmatic point of view, if one title borrows
from another to a sufficiently large degree (especially without citation), it might be
considered un
-
original, if not plagiaristic. If the two titles have so little in common
that they do not see
m to borrow from each other, one might say they are originals
(e.g. a romance novel


not all
-

can have a

formulaic plot, but use

different sentences
and paragraphs that do not overlap to any noticeable degree with an existing romance
novel with exactly t
he same plot). This form of originality

(or lack thereof)

is
often
seen
in television game shows. Each episode is original, but each episode uses the
same segment sequences. Original

and very entertaining
, but not that creative from
one episode to the next
.
In essence, viewers crave the formula and want to see it
repeated in original episodes.
Th
e genre in its entirety, of course
, can be a very
creative result.



What is quality? It lies in the eye of the

segment
” (in publishing industry jargon). A
trade s
tudy can

be far more useful than a romance novel to someone wanting to
prioritize world markets for the products they are selling. The opposite is true for
someone who love novels and is not involved in international trade. There are
segments to content ma
rkets.

Can a computer
, therefore,

write prose that is
higher
quality than Shakespeare?

Of course
;

especially if the person
comparing

passages
side
-
by
-
side hates Shakespeare or does not understand Elizabethan English

(probably
a large enough segment)
. Will
a computer
generate work reaching
the stature of
Shakespeare in English courses


I
doubt it (unless, of course, the formulas used by
the Master can one d
ay be reverse engineered
; or a great author
of that league,
as yet
unknown, is also a great programmer
).


Will this make human authorship obsolete?

For some forms, potentially "yes", for at
least the formulaic or mundane forms of human authorship, or for human authorship
of genres that are uneconomical otherwise. Which genres of authorship (in video, text

or other formats) are not formulaic enough to be automated? Time will tell.


I h
ope this clarifies

& thanks for reading
.

Phil





More Background




Overview

Some like calling it a “book writing machine” or “software” but in fact it is a
computer
-
based au
tomation process for authoring, irrespective of the format (book,
video, PC games, etc.), language, or subject (fiction or non
-
fiction). For those
interested in the technical aspects of the process, please refer to the actual patent
which presents flow dia
grams, etc., and to a YouTube video that tersely describes the
process and shows an example an application and some output:


Patent:

http://www.google.com/patents?id=bHeBAAAAEB
AJ&dq=philip+m+parker


YouTube:

http://youtube.com/watch?v=SkS5PkHQphY



It is strongly recommended that interested persons read the full patent. On the patent
page, the reader will find detailed tech
nical descriptions of the process and the prior
art. Professor Parker began working on this project in the early 1990s. The goal was
to create original titles (book, videos, games, etc.) on topics that would not be
economically viable if published using tr
aditional methods, or covering topics that
might be of interest to a limited audience that would nevertheless find the titles useful
(what some call the “long tail”). The process does not require “Internet scraping”, and
most existing implementations of th
e process are Internet independent. The patent is
written as a “pioneer patent” as it applies to all forms of original title materials
(videos, books, PC games, etc.) created in this fashion.


Forms of Authorship

Much as authors publish various forms of f
iction and non
-
fiction literature, it is
convenient to see the method or process as allowing various forms of authorship
automation (which can be used in combination).


Form 1:

Involves compiling existing information, sorts, formats, and draws
basic conclu
sions (e.g. if there is no pre
-
existing content, then this fact alone
may lead to original logical conclusions drawn about the topic). This level is
useful for consolidating and structuring knowledge in a domain where much
of the text, video or sounds pre
-
exist. The programming for this approach
typically involves hundreds of details, especially with respect to formatting
and style. Typically in the form of a compilation, some of the output
components will be original, and can result in new knowledge.


Form

2:

Involves replicating a formula within a genre. In this case, new
knowledge is not necessarily generated, though the reader or viewer may end
up acquiring new knowledge. In this case, the data (words) may be in the
public domain on a stand alone basis,
but the output is as original as what a
human author (or director, screenwriter or actor in the case of a movie) might
create. The final result is typically wholly original.


Form 3:

Involves the generation of new knowledge as the primary goal. This
involv
es, for example, the computer mimicking a
specialist

that is asked to
prepare a report, film or game that draws original conclusions, images or
levels of entertainment. For example, if one asks an economist for an opinion,
the economist will typically perf
orm an analysis and make summary
statements based on his or her findings. The automation process, in this case,
literally follows the
behaviours

of the economist, and reports the findings
--

findings that have never appeared before in any format or which p
re
-
exist in
any database or are currently available on the Internet. The computer, in this
case, is pre
-
invested with knowledge or expertise (e.g. economic models and
knowledge of economic geography). For this approach, the word “specialist”
is domain inde
pendent. We can rewrite the example from above to be:


“For example, when one asks a poet for a poem on a given subject, they will
typically ponder on the subject and write prose based on their inspirations.
The process, in this case, literally follows th
e
behaviours

of a poet, and
creates a poetry book


consisting of poems that have never appeared before
in any format.”


The distinction between poetry and econometrics is the formulaic natures of
the genres, but not the process to author them. The third l
evel can create high
-
end econometrics to the same degree that it can write poetry. It turns out that
the most useful applications at Level 3 are for genres that are so complex or
labour

intensive, that automation is almost the only viable approach.

Histor
y


The origins and research began on this approach in the 1980s and early 1990s. The
first titles authored via full automation relied on
Form 3

(described above)



having
the goal to generate of new knowledge that would be difficult to accomplish
otherwise
. These came in the form of e
-
books distributed via high
-
end distributors
dedicated to this market (Dialog, MarketResearch.com, etc.) and then print
-
on
-
demand titles (Ingram’s LSI and Amazon’s Booksurge). The “Trade Perspective”
series was created due to t
he inconsistencies of import data from importers, and
export data from exporters. The model comes up with maximum
-
likelihood estimates
of real trade flows (adjusting for currency fluctuations)


a rather boring process but
of interest to people involved in

international trade. This series is mostly used by
government agencies and businesses. Similar series using
Form 3

are “Word Outlook
Reports” that produce Bayesian econometric estimates for the worldwide latent
demand for various products and services, an
d the “financial and
labour

benchmark”
series which mimic the process used by accounting firms and/or investment banks to
compare real differences in economic performance across firms and/or economies
with differing accounting rules. For each of these seri
es, there is a very large upfront
cost to creating a series like this (many man
-
years of programming in most cases), but
once this is accomplished, the incremental cost per title is very low (the costs
mentioned by journalists are the incremental cost of a
bout 50 cents, not the total or
average cost per title which are must higher when considering start
-
up costs). Samples
of these books can be found at
http://www.icongrouponline.com/browse/
.


Later, ser
ies using a combination of
Form 1
and
Form 2

were created in the form of
patient and physician sourcebooks. Around 2001, medical libraries launched efforts
on “internet training” for their patrons (e.g. how to use the internet to research
diseases). This s
eries was created for this market and is mostly distributed via
OCLC’s NetLibrary service in e
-
book format.
Form 1

was also used to create a series
of bi
-
lingual classic titles which provide a running thesaurus in the language of the
reader.


More recentl
y, multilingual crossword puzzle books and thesauri were created using
Form 2
. Some of the thesauri rely on a graph theoretic approach (combined with
traditional computational linguistics) to derive what is probably the world’s largest
multilingual thesaur
us.


A small percent of the databases required for some of these later genres is posted on
Webster’s Online Dictionary (www.websters
-
online
-
dictionary.org), that was started
in 1999 as a testing ground for the general approach (i.e. the automatic authorin
g or
original content on a web site):


Some Background & Reviews:

http://www.websters
-
online
-
dictionary.org/credits/editor.html




Another Review:

http://hurricanecountry.blogspot.com/2006/12/dictionary
-
heaven.html



The Objective:

http://www.websters
-
online
-
dictionary.org/about.us/about.html




The Site:

www.websters
-
online
-
dictionary.org



As only 10% of the data available are posted, future editions will be substantially
larger and allow for high levels of interactivity.


Recent History

In terms of R&D, substantial time and effort is currently being invested to create (1) a
series of interactive web sites that can automatically author titles, (2) educational
game shows and (3) language learning programs. With respect to v
ideo, instead of
automating “Word” to author a book, the same process is being used to automate
Maya and video editing software (software for 3d animation/video used in movies like
King Kong, the Matrix, and Shrek). The goal is create video programming to
teach
any concept, but also in any local language. It turns out that for most of the World’s
languages (e.g. Estonian, Maltese, etc.), the costs of video programming using
traditional methods is prohibitive, so local stations end up dubbing foreign
-
based
p
rograms (or programs receiving government subsidies). This project started in 2004
with 3d games and software (a bit easier to begin with than video) which has resulted
in hundreds of titles distributed by Digital River, Handango and Microsoft (for Pocket
PC versions) among others. The following is a YouTube link to cut scenes from a
game show designed for language learning


a formulaic form of television (that is
being coded for automation):


http:
//www.youtube.com/watch?v=Fug4UGbsIxY



The following is a video “word of the day”, that will be used across many languages:



http://www.youtube.com/watch?v=slNTZ4vEqGQ


Here is a cut scene fro
m the 3D game:


http://www.youtube.com/watch?v=2QBC5zlXdDw




FAQ


This FAQ covers other common questions. For each question, the generic answer is
typically “it depends on the genre” and “it depen
ds on the format (book, video,
software, PC game, etc.).”


Q: Can I have a copy of the software?

A:
No. The process is not a software package, but a complete system that requires that
a computer or computer network be set up for this purpose


for a parti
cular genre.
Most genres are too large to be easily transferable via the internet. One video
application is many terabytes, and other applications are many gigabytes.


Q: How long does it take to set up a genre?

A:
This completely depends on the complexit
y of the genre and the quality one is
willing to accept for the titles. The earliest genres took several man
-
years to create
before they met industry standards (i.e. to the quality of a human author). The later
genres took a matter of months (e.g. cross
-
wo
rd puzzle books). Sometimes the longest
part is acquiring and coding domain knowledge (e.g. knowing how a Ph.D. thinks in a
particular domain before they author a genre). Already published genres rely on
advanced graph theory and
econometrics;

others rely
on traditional content analysis.


Q: How much does it cost to produce a book or other title?

A:
A: Depends on how you define cost. The marginal cost of creating a title in
electronic format is the price of the electricity used to create the title, and some

small
amount of hardware depreciation (maybe around 50 US cents). The average cost,
which includes the printing of the book (in paperback), or a game in DVD or CD
format (printed on demand), and the overhead to distribute the book can range from
around $1
0 to around $30. The total cost for an entire genre of books, videos, or
software games can exceed hundreds of thousands of dollars or more in programming
time, database acquisition or licensing, and other overheads. Once a large sum of sunk
costs are expe
nded, the marginal costs are minimal. For video or high
-
end gaming, the
costs can be very high; with the budget to create a single traditional 3D animated
movie, however, one can use this approach to create thousands of titles within a given
video genre.


Q: Is this really that complicated?

A:

It depends on the genre and format. During early genres it was found that rather
complicated issues were simple to implement (e.g. Bayesian econometrics), and
logically simple things were nearly impossible to implemen
t (e.g. getting Windows to
behave well when indenting certain graphics, or rendering in DirectX). In general,
Joseph Weizenbaum says it all:


'It is said that to explain is to explain away. This maxim is nowhere so well fulfilled
as in the area of compute
r programming, especially in what is called heuristic
programming and artificial intelligence. For in those realms machines are made to
behave in wondrous ways, often sufficient to dazzle even the most experience
observer. But once a particular program is
unmasked, once its inner workings are
explained in language sufficiently plain to induce understanding, its magic crumbles
away; it stands revealed as a mere collection of procedures, each quite
comprehensible. The observer says to himself, "I could have w
ritten that." With that
thought he moves the program in question from the shelf marked "intelligent" to that
reserved for curios, fit to be discussed only with people less enlightened than he.'



CASE STUDIES



The following case studies illustrate a few
examples of how
the
technology has been
used to create large quantities of original title materials. These are presented for
illustrative purposes only, and reflect a small part of potential applications.


Reference, Research & Educational Books

Output
: O
ver 250,000 original titles, available in various paperback and ebook formats
(
www.icongrouponline.com
).

Distributors
:
Barnes & Noble®
;
amazon.com
;
Lightning Source

(Ingram Book Group
);
Net
Library [OCLC
-

eContent]
;
Ingram Digital

and
MyiLibrary
;


ebooks.com
;
google.com
,
among others.


Beyond the tasks accomplished by acquisition editors and publishers, books are traditionally
written by humans authors, edited by humans, and formatted b
y human production editors.
These are in turn marketed by humans. Using the most advanced approaches to electronic
publishing,
this approach
reduced the time to create and publish
reference and educational
books. The approach is of interest to
the publishi
ng industry which is becoming more
fragmented and specialized as print
-
on
-
demand and ebook technologies are showing
substantial growth. Coupled with electronic distribution via libraries, publishers and media
companies can now access what may have previous
ly been seen as saturated markets.
Examples of genres produced for ICON Group include:





Patient Sourcebooks

(500 titles by disease or condition)



Physician Dictionaries

(2100 titles by disease or condition)



Genome Sourcebooks

(190 titles by disease or condition)



Bi
lingual Crossword puzzles

(1200 titles, 100 pages each)



Classics



enhanced via computer authoring for

test preparation (150 titles)



Classics



enhanced for non
-
English mother tongue speakers

(1000s of titles)





Scientific Discovery, Research, Custom Publishing and Propos
al Writing




Output
: Over 150,000 Industry and Business Intelligence Reports.

Distributors
:
marketresearch.com
;
www.bharatbook.com
;
manta.com

(ECNext);
MindBranch
, and
EBSCO
, among others.




In terms of discovery, intelligence analysts, researchers, scientists, security specialists, or
anyone who must "connect the dots" m
ay not have the time or capacity to exploit their skills
to a maximum potential. The databases and/or sources of information used to generate and
quickly communicate new knowledge may be so vast or complex that traditional approaches
simply fail to exploit

the potential.


Similarly, in business, a substantial amount of valuable
management time can be wasted writing proposals, or proposals are never written resulting in
opportunity losses.



This approach has been used to create, for example, approximately 1
4,000 international trade
studies that draw original conclusions with respect to the world's trade flows across numerous
product categories. The meta data and related information required for distribution for each
title were also authored via automation. E
xamples of these titles can be seen
here
, for
example, at
marketresearch.com
, one of the largest distributors of high
-
end market
intelligence. Had this genre been approached using traditional methods, the economics of each
title would make th
e cost of producing these prohibitive. This approach can also be used to
localize educational

content for specific markets, down to an individual instructor or student.




Networked Multiplayer Games/Simulations

Output
: A virtually infinite number of busin
ess simulations for INSEAD (Singapore and
Fontainebleau, France), INTERCOMP Simulation (
www.insead.edu
).

MBA programs and executive education programs around the world have, for years, relied on
business s
imulations to teach strategy, operations, and marketing. These simulations or games,
are played by teams or individuals who compete against each other while learning and
applying business frameworks.


Traditionally business simulations have been industry (
e.g. consumer electronics), geography
(e.g. a fictitious world) and/or language specific (e.g. English). INTERCOMP is not a
simulation, but rather a simulation "writer." It was created using an approach that allows a
virtually infinite number of simulation
s on any known industry (e.g. from toothpaste to
industrial power transformers), any realistic geography (within a specific country, like China
and its various cities, or across a selection of countries and cities relying on real economic
data), and langua
ge (English, French, Chinese, Arabic, or any of 200 or more other
languages). The simulations can be further tailored to specific business topics or emphasis
(e.g. HR, finance, production, marketing, strategy, etc.). An example of one such simulation is
de
dicated to the mobile communications handset industry that pits
Apple
,
Nokia
, HP,
Dell
,
Motorola
, HPC, Samsung, LG, and Sony
-
Ericsson against each other in a global battle to
conquer the world market across 57 countries. The setting is five years into the
future when a
new generation of mobile communications standards has been adopted by operators and
manufacturers. This simulation has been used in an award
-

winning MBA elective and
executive education course; a version dedicated to telecommunications is av
ailable for
download at:


http://webfac.insead.edu/intercomp/downloads/program_latest_version.html



The advantage of this process is that simulations and/or multiplayer games can be created at
minimal cost
for a specific group, or “clique” of executives or individuals in a specific
industry, simulating real competition faced in that industry. Because the simulation can be
calibrated using real data, the output is not a simulation, but a strategic planning to
ol that can
be used to foresee competitive activities or simulate game theoretic outcomes. After setup, no
clique is too small for a fully customized simulation or game, given that the marginal cost of
producing a game for the clique is virtually zero.







PC Software and Video Games

Output
: 400 Educational Game Titles and over 1200 Reference Software Titles.

Distributor(s)
:
www.digitalriver.com



There are role
-
playing games, adventure games, first person sho
oters, strategy games,
sports
games
, educational games and a variety of others. Each of these follow a generally accepted
set of rules which users have come to expect. Each title can be in 2D or 3D formats designed
for a variety platforms (PC, console, mob
ile devices); each format is further bounded by
formulaic requirements. Traditionally, dedicated teams create a single title within a genre,
each with a substantial cost.


I approach game development by automating "game writing" programs which author origi
nal
titles, surrounding the entire genre selected. A recent example of this was a series of some
2000 third
-
person shooter
PC games

that allow children, ages 4 to 6, to learn basic English as
a second language (or other topics). A tomato, called "Webster"
defeats an enemy called
IGNORANCE, who has armies of evil avatars (e.g. from dinosaurs to space ships). Within
each topic covered by this sub
-
genre, there are 4 separate game titles featuring differing
graphics, sound effects, challenges/puzzles and enemie
s. A video cut scene illustrating this
game series can be seen
here
. Some game play can be seen
here

(towards minute 8). Each
game title t
akes approximately 5 to 10 minutes to create, irrespective of the topic.


Here

is a
low
-
resolution screen capture of an extended video of a game created this way. 2D
multilingual games are listed
here
.






Mobile Phone Applications (Pocket PC & Smartphone)

Output
: Thousands of Pocket PC dictionaries and games for Handango
(
www.handango.com
),
Microsoft.com
, and others.

Recent research indicates that people in many low
-
income countries often first experience the
Internet via a mobile communications device. In high
-
income countries, Smartphones, Pocket
PC's

(PDAs), multimedia phones and video players are gaining greater acceptance as users
upgrade from traditional devices, and operators push higher
-
end handsets which increase
network traffic. Greater on
-
board memory, and higher download speeds are also creat
ing
greater demands for mobile content tailored to a large number of localities with differing
content needs.




Traditionally, mobile content publishers create a game or application, and once successful
localize these titles for large markets or create s
equels to the one market where the
title was
successful. The
technology allows original titles cover the entire spectrum of
topics/geographies within their respective genres, with each title authored in a matter of
minutes. Automation also allows for cross
-
platform authoring, given the variety of
operating
systems

(RIM, Symbian,
Microsoft

Mobile, etc.) and devices (iPod/iPhone,
Nokia
,
Motorola
, Sony
-
Ericsson, Samsung, HTC, Blackberry/RIM, LG, etc).


An application of
the
technology in this area includes the

creation of a mobile phone software
generation programs for educational games and references software applications. Some
400
casino games
,
200 bi
-
lingual dictionaries
, and thousands of professional reference
applications have been authored and are currently selling via various distribution c
hannels (for
PocketPC and Smartphones).





Web Site Creation

Output
: World’s largest multilingual dictionary: Webster's Online Dictionary
(
www.websters
-
online
-
dictionary.org
).


Listed,

for the year 1999, as an important “invention” of the 20
th

century by
The Great Idea
Finder
, Webster’s Online Dictionary


The Rosetta Edition is an open access dictionary that
spans
over 400 languages. The dictionary is now the world’s largest and is a mix of compiled
and original content generation using
the

technology. Despite the dictionary being so large
(with over 20,000,000 entries, and growing), it is maintained by no editorial
, marketing or
other staff. Well over 40% of the content, statistics, and entries were authored by computer, in
the same manner that a lexicographer or linguist would. The dictionary is constantly being
improved and is a laboratory for innovation. Currentl
y the dictionary receives some 1,000,000
page views a month, and is ranked higher, in terms of traffic, than the Oxford English
Dictionary.


Over 1,000 sites link to the dictionary or its pages. Some 85 percent of the site’s
traffic comes from outside of t
he United States, and is, for many languages, the primary site
for language learning and reference. The dictionary is in its “first draft” form. Reviews and
historical discussions


of the current edition can be found
here
. Similarly,
the approach

can be
adapted to create a high volume of content
-
oriented sites that span languages or topics, for use
over traditional or mobile networks, that themselves become authors of original con
tent, with
or without end
-
user interaction.





Video (All Formats & Media)

Output
: Various high volume programs.


The cost of professional video production involves a large quantity of human inputs from
producers, scriptwriters, actors, and directors,


to

set designers, photographers, camera crews,
special effects specialists, and pre
-

and post
-
production editors. Human and material costs
have often prevented the creation

of niche programming or films on narrow topics, or for
languages or cultures that mi
ght not have a large enough audience to profitably justify an
investment. This has lead to content shortages for many countries, languages, interest groups
or cliques (micro
-
segments).


The substantial costs of production have also lead to a number
of medi
a companies relying on user
-
generated or contributed content of variable quality
and/or that will fail to meet the needs of these unserved niches (e.g. there are not enough video
producers interested in, say, Tarahumara to justify creating enough content t
o support a
channel for that audience).


Automated video authoring is similar in nature to that of books or software, though the
formats have higher dimensionality and the "intelligences" modeled are different. The goal is
to drive the cost of high
-
quality

video production to a minimal marginal cost (e.g. the cost of
rendering alone).


The
technology is now being used for video production for a variety of the more formulaic
genres (news, games shows, education, mobile phone snacks, classic story telling, co
medy,
etc.). Examples of test renders for mobile telephone snacks and television segments can be
found here on YouTube:




Mobile/Traditional Snacks








Word of the Day “Snack”


Macroglossia

(thousands of these across languages are
in production).



Word of the Day “Snack”


Hindsight




Word of the Day “Snack”


Euphonious




Word of the Day “Snack”


Laconic




Word of the Day “Snack”


Excretion




Gameshow






A Multilingual Gameshow

(cut scenes only, created for all written languages, for
people wanting to learn English).




Advertising/Promotion



A Video Promotion Clip for a Hangman Game

(also authored via computer).




Segues



A Classic Movie Review before it Airing




A DVD Introduction Segment








The Future





As the above cases illustrate, the application of
the

technology is format and context
independent. Only a small percent of ideas are represented here. Future applications, in the
works, include full
y interactive, real
-
time authoring systems and other activities that fully
integrate human activities, allowing third parties, but also end
-
users to allow their systems to
create original title materials.




Glossary of Important Terms and Concepts





The

following glossary can prove useful to our partners in approaching automated content creation. We
have sorted these definitions in a logical order of “conception” to “delivery”:




Method and apparatus for automated authoring and marketing
: an approach f
or automatic
authoring, marketing, and/or distributing of title material. A computer automatically authors material.
The material is automatically formatted into a desired format, resulting in a title material. The title
material may also be automatically
distributed to a recipient. Meta material, marketing material, and
control material are automatically authored and if desired, distributed to a recipient. Further, the title
material may be authored on demand, such that it may be in any desired language an
d with the latest
version and content.




Original work of authorship
: Works of authorship include title materials, such as literary works;
musical works, including the lyrics; dramatic works, including any accompanying music; pictorial or
graphical works
; motion pictures and other audiovisual works; sound recordings; and any
compilations and/or derivative works or the work of authorship; and other materials.




Materials:


any information and data capable of being used in a title material, for example te
xt, audio,
video, descriptive, tabular, artistic, and/or graphical information.




Title material:

publishable and/or authored work, such as literary works, serial publications, theatrical
plays, books, including fiction and nonfiction works (for example,

but not limited to, reference
books, market research reports, travel guides, company competitive analyses, industry reports,
company reports, management consulting reports, technical documents, and the like), newsletters,
magazines, computer instructions,

software, software publications, Internet publications, computer
-
based content, Internet web sites, musical scores, screen plays, video productions, holographic or 3
-
d
works, virtual reality works, and the like. Alternatively, title material includes any
work that is
capable of being associated with a unique identification alpha
-
numeric code, for example a unique
alpha
-
numeric identifier that is used to identify the work or a catalog number. Title material also
includes any work that is capable of being as
sociated with a unique alpha
-
numeric codes, such as an
ISBN (International Standard Book Number), ISSN (International Standard Serial Number), a UPC
(Uniform Product Code), a library number (such as the Library of Congress identifier), a bar code, an
item
number, an SKU (Stock Keeping Unit), a number code, a case law number, a docket number, an
abstract number, a year of publication, a chapter code, and the like. Title material can also includes
any authored or published work that is to be commercially avai
lable. Title materials can include any
work with an alpha
-
numeric numbering system that is observable or intended to be observable within
the public domain.




Marketing material:

includes information used to market, disseminate knowledge of, or promote t
itle
material. Marketing materials publicize or announce title materials to various audiences, including
remote servers that post electronic announcements. Marketing material includes public relations
works, press releases, product announcements, brochures
, flyers, billboards or outdoor copy, video,
audio, magazine or print media copy, emails, banners, displays or similar materials, etc..




Meta material:

include materials used to describe title material. Meta materials may be used in the
publishing and m
edia industries to catalogue and/or promote title material. Meta materials describe
title material to publishers, resellers, distributors, industry associations, industry organizations,
government organizations, or end
-
users such as libraries or individual
s. Further, meta materials may
include text, graphics, numerical data, coverings (such as a book jacket, a CD jacket, videotape
jacket, or the like) or other information that is used to describe the title material. Additionally, meta
material may include,
but is not limited to, information regarding the price of the title material, the
length in pages or time of the title material, the language of the title material, the physical or
electronic format of the title material, the binding or packaging of the ti
tle material, an abstract of the
title material's content, an alpha
-
numeric identification number of the title material, subject codes or
text of the title material, comments from the author of the title material, comments from the publisher
of the title m
aterial, credits related to the title material, endorsements of the title material, reviews of
the title material, a table of contents of the title material, date of publication of the title material,
place of publication of the title material, name of the

publisher or producer of the title material,
address of the publisher or producer of the title material, or the like. Further, meta material includes
meta files and/or metadata.




Control materials:

include any information used to control, track, index
or account for title material.
Control material include items in meta, title or marketing materials, but may also include information
used for inventory control, billing, financial accounting, stock keeping, information relating to the
target audience, and

cataloguing information used for internal control.




Database files:


include modules, querie
s, macros, reports, tables, templates, graphics, automation
programs, audio and video files, data files, material files, information in a database, document files,
and the like.




Genre:


A genre is a group or series of title materials having common char
acteristics or using similar
procedures to be authored. Genres include, for example, a series of market research reports having
similar formats, logical statements, calculations, graphics, or patterns with different content for each
title material within t
he genre. A genre of materials may include multiple materials having similar
characteristics.




Recipient:

A recipient is any individual, entity, computer, or the like, that is capable of receiving title,
meta, marketing, and/or control materials authore
d by the present invention. For example, a recipient
may include a distributor or an end
-
user of the title material.




User:

A user includes any individual, entity, computer, or the like, that is using the system of the
present invention to automatically

author, distribute, and/or market title materials.




End
-
user:

An end
-
user includes any individual, entity, computer, or the like, that is to be the ultimate
consumer of the title material.




System of networked computers:


any system of multiple comp
uters that are directly or indirectly
interconnected by any types of electronic connections, including connections via hardwire, Ethernet,
token ring, modem, digital subscriber line, cable modem, wireless, radio, satellite, and combinations
thereof. Such c
onnections may be implemented using copper wire, fiber optics, radio waves,
coherent light, or other media. The system of networked computers may be the Internet, an intranet, a
secure virtual private network (VPN), or any other system of computers that ar
e interconnected by
electronic connections. As used herein, the term "network" refers to any such system of networked
computers, including the Internet. Likewise, as used herein, the expression "providing a system of
networked computers" means creating a n
etwork specifically for the purpose of facilitating the
present invention or simply connecting to an existing network for the purpose of facilitating the
present invention.




Computer:

any general
-
purpose machine that processes data according to a set of

instructions that is
stored internally either temporarily or permanently, including, but not limited to, a general purpose
computer, workstation, laptop computer, personal computer, set top box, web access device (such as
WEB TV.TM. (Microsoft Corporation
)), cable television, satellite television, broadband network, an
electronic viewing or listening device, any other type of computer, wireless devices, such as a
personal digital assistant (PDA), cellular or mobile telephones, electronic handheld units for

the
wireless receipt and/or transmission of data, such as a BlackBerry
®

(Research In Motion Limited),
or the like.



Learning More



If you would like to organize a seminar for your company on this topic, please contact
INSEAD
’s

Executive
Education de
par
tment.