A Life in Research
APPENDIX C: THE NATURE OF COMPUTER-SCIENCE RESEARCH
There is a fundamental difference in Particle Physics
(PP) research in which I was earlier engaged in, and
Computer Science (CS) research in which I entered later. In
PP the objective is to discover the natural properties,
often expressed in complex and abstract mathematical models,
which can take long computer time in evaluation.
The length of
computation time does not bother a physicist as long as the
model is correct. In contrast, a quick execution is an
essential requirement of a computational model, even if it
means cutting some corners.
Computer Science is a technological science, which can
be defined as:
A good (i.e. sound) computational model for the
problem domain.
There may be many computational models (unlike the
single reality of fundamental physics) to solve a problem,
where the researcher is looking for the best one, but unlike
physics, the best one will not be necessarily acceptable if
it is too complicated and computationally too expensive. For
example, it is possible to make a generalised CS model for a
problem domain, but a generalised model could be
computationally too expensive, and hence would not be used.
In contrast, a generalised model is the one that a
physicist will generally prefer, as it will be assumed to
reveal the deeper truth – here the computational cost is
irrelevant. Consider the experiments in CERN for finding
Higgs bosons. The researchers there do not talk about the
computation times of various modelling options, what they
talk about is the result, i.e. if a signature for Higgs
bosons has been found. I am not saying that researchers do
not care about efficient computation, but it is far lower
down in the food chain.
Here is a good experience that I had on this in 1982,
when I produced a general architecture for distributed
databases (DDBs), which was well-received by the CS
researchers. The idea was that a diverse range of databases,
of different data models (i.e. different data-structures and
query languages), including pre-existing databases (each
such database situated perhaps at a different network node),
should be able to join a distributed database system,
through which a user should be able (i) to treat the
underlying databases as “a single database”, (ii) to query
that “single database” by a global language and (iii) to
receive integrated answers.
Such a facility requires all these underlying
databases to support inter-database conversions and
interpretations (both semantic and syntactic) of
data-structures and query languages, transparent to the
global users. The necessary conversions, which are complex
and time-consuming, have to be carried out dynamically as
global queries are presented to the DDB system for
execution. Contrast this position with Google search, which
simply lists articles from different sources for the
inquirer to view and read, nothing more. Its chief virtue is
its execution speed. There is no attempt in Google to
combine these answers (articles) into a single integrated
answer (article), and hence Google is not a distributed
database (DDB) facility.
A DDB is far more ambitious, and its answers from
different remote sources include both data and text, which
involve as mentioned above, both semantic and syntactic
interpretations for the production of a single integrated
global answer, and hence its complexity.
The IBM ignored such a genuine distributed database
facility, and instead extended the software of their
existing SQL database system, allowing the data of a single
database to be partitioned and held over multiple computers
connected by a communications network. They called it a
distributed database, which made us very cross – a betrayal
of the concept of true distributed databases, we thought. We
had then a DDB conference that year in Berlin, which I
attended. I was riding high, because of my DDB architecture.
An IBMer presented their DDB architecture, but he was
attacked by everyone, so much so that he escaped from the
rest of the conference. But
today it is the IBM’s version of distributed database that
is used, nobody would even implement our model, which they
would consider impracticable. However, I still believe that
sometime in the future, people might implement a distributed
database facility of my kind for at least in-house
management reports from diverse sources.
Given the definition of CS as a good (i.e. sound)
computational model for the problem domain, how do we go
about doing CS research? I present the following approach
that guided us, but using example from the development of
the Relational Database Model for clarity.
Observe that in Computer Science, invention is said
to be harder than discovery, since invention has to be based
on the solid foundation of experimentation. Invention, I
presume, is our aim. Borrowing some ideas from Professor
Milner of Edinburgh, and with some modifications, I produced
my version of an approach to invention research:
-
Examine a problem domain,
and abstract the requirements. Sometimes several problem
domains can lead to the same abstraction and also some
aspects of the problem domain could be too different (or
too complex) to be captured by a single elegant
abstraction theme, that is, the abstracted domain may
not model the whole of problem neatly, but only part of
it. This is okay.
-
Produce a good solution (a
sound computational model) for the abstracted domain.
-
Demonstrate that the model
works not only for that problem domain but also for
similar other problem domains.
-
The wider the range of
problem domains over which this model works, the better
is the model.
Consider the Relational Model as an example. Dr E. F.
Codd [Ted Codd] was trying to represent data in a simple way
that could be queried also in a simple way. He thought of
tabular structures, but he also understood that all data may
not be amenable to a tabular form. For example, repeating
groups (e.g. number of windows in a house, or the number of
months in a year) and hierarchical structures (such as a
name made up of surname and initials) require more complex
data structures than simple tables. Maps and engineering
drawings are further examples of complex structures that
cannot be easily represented by simple tabular data.
He accepted these limitations and developed a query
language with five basic operations for handling tabular
data. (i) how to extract one or more columns from a table,
(ii) how to extract one or more rows from a table,
(iii) how to insert rows of one table into those of
another, (iv) how to remove rows from a table, and (v) how
to insert column(s) of one table as additional column(s) of
another table.
He defined these operations mathematically (i.e. with
precision) and proved mathematically that these five
operations are necessary and sufficient for all operations
over tabular data structures (i.e. in the relation model),
except for some statistical operations, such as average,
minimum and maximum values, etc.
Observe that any single value can be thought of as
the value of a table of one column and one row. This is the
power of the model where all operations over one or more
tables produce another table as the result, successively.
Needless to say that the success of this model has been so
staggering that people have now extended it even for some
non-tabular data-structures.
The criteria we used to judge the success of a model
is: does it make SENSE? The SENSE criteria are:
S
Soundness: that is the model is sound.
E
Employability: that is, the model can be employed directly
on a problem, without any further and subjective
assumptions. Many models fail on this, when the provisions
of a model require further, often far-away, subjective
interpretations before the model can be applied to solve a
problem. Some PhD theses (models) fail utterly on this
employability criteria.
N
Necessity: that is, the assumptions made are as few as
possible i.e. MINIMALLY necessary assumptions. It implies
that all assumptions must be orthogonal with respect to one
another, without any overlapping assumptions.
S
Sufficiency: that is, the concepts used are sufficient – no
further concepts are needed to solve the problems in that
domain.
E Efficiency and effectiveness:
that is, the solution is efficient and effective.
Everything should be made as simple as possible – but
not simpler – A. Einstein.