A Life in Research

 

 

APPENDIX C: THE NATURE OF COMPUTER-SCIENCE RESEARCH

 

There is a fundamental difference in Particle Physics (PP) research in which I was earlier engaged in, and Computer Science (CS) research in which I entered later. In PP the objective is to discover the natural properties, often expressed in complex and abstract mathematical models, which can take long computer time in evaluation.   The length of computation time does not bother a physicist as long as the model is correct. In contrast, a quick execution is an essential requirement of a computational model, even if it means cutting some corners.

 

Computer Science is a technological science, which can be defined as:

    A good (i.e. sound) computational model for the problem domain.

 

There may be many computational models (unlike the single reality of fundamental physics) to solve a problem, where the researcher is looking for the best one, but unlike physics, the best one will not be necessarily acceptable if it is too complicated and computationally too expensive. For example, it is possible to make a generalised CS model for a problem domain, but a generalised model could be computationally too expensive, and hence would not be used.  In contrast, a generalised model is the one that a physicist will generally prefer, as it will be assumed to reveal the deeper truth – here the computational cost is irrelevant. Consider the experiments in CERN for finding Higgs bosons. The researchers there do not talk about the computation times of various modelling options, what they talk about is the result, i.e. if a signature for Higgs bosons has been found. I am not saying that researchers do not care about efficient computation, but it is far lower down in the food chain.

 

Here is a good experience that I had on this in 1982, when I produced a general architecture for distributed databases (DDBs), which was well-received by the CS researchers. The idea was that a diverse range of databases, of different data models (i.e. different data-structures and query languages), including pre-existing databases (each such database situated perhaps at a different network node), should be able to join a distributed database system, through which a user should be able (i) to treat the underlying databases as “a single database”, (ii) to query that “single database” by a global language and (iii) to receive integrated answers.  Such a facility requires all these underlying databases to support inter-database conversions and interpretations (both semantic and syntactic) of data-structures and query languages, transparent to the global users. The necessary conversions, which are complex and time-consuming, have to be carried out dynamically as global queries are presented to the DDB system for execution. Contrast this position with Google search, which simply lists articles from different sources for the inquirer to view and read, nothing more. Its chief virtue is its execution speed. There is no attempt in Google to combine these answers (articles) into a single integrated answer (article), and hence Google is not a distributed database (DDB) facility.  A DDB is far more ambitious, and its answers from different remote sources include both data and text, which involve as mentioned above, both semantic and syntactic interpretations for the production of a single integrated global answer, and hence its complexity.

 

The IBM ignored such a genuine distributed database facility, and instead extended the software of their existing SQL database system, allowing the data of a single database to be partitioned and held over multiple computers connected by a communications network. They called it a distributed database, which made us very cross – a betrayal of the concept of true distributed databases, we thought. We had then a DDB conference that year in Berlin, which I attended. I was riding high, because of my DDB architecture.   An IBMer presented their DDB architecture, but he was attacked by everyone, so much so that he escaped from the rest of the conference.  But today it is the IBM’s version of distributed database that is used, nobody would even implement our model, which they would consider impracticable. However, I still believe that sometime in the future, people might implement a distributed database facility of my kind for at least in-house management reports from diverse sources. 

 

Given the definition of CS as a good (i.e. sound) computational model for the problem domain, how do we go about doing CS research? I present the following approach that guided us, but using example from the development of the Relational Database Model for clarity.

Observe that in Computer Science, invention is said to be harder than discovery, since invention has to be based on the solid foundation of experimentation. Invention, I presume, is our aim. Borrowing some ideas from Professor Milner of Edinburgh, and with some modifications, I produced my version of an approach to invention research:

 

  1. Examine a problem domain, and abstract the requirements. Sometimes several problem domains can lead to the same abstraction and also some aspects of the problem domain could be too different (or too complex) to be captured by a single elegant abstraction theme, that is, the abstracted domain may not model the whole of problem neatly, but only part of it. This is okay.
  2. Produce a good solution (a sound computational model) for the abstracted domain.
  3. Demonstrate that the model works not only for that problem domain but also for similar other problem domains.
  4. The wider the range of problem domains over which this model works, the better is the model.

 

Consider the Relational Model as an example. Dr E. F. Codd [Ted Codd] was trying to represent data in a simple way that could be queried also in a simple way. He thought of tabular structures, but he also understood that all data may not be amenable to a tabular form. For example, repeating groups (e.g. number of windows in a house, or the number of months in a year) and hierarchical structures (such as a name made up of surname and initials) require more complex data structures than simple tables. Maps and engineering drawings are further examples of complex structures that cannot be easily represented by simple tabular data.

 

He accepted these limitations and developed a query language with five basic operations for handling tabular data. (i) how to extract one or more columns from a table, (ii) how to extract one or more rows from a table,  (iii) how to insert rows of one table into those of another, (iv) how to remove rows from a table, and (v) how to insert column(s) of one table as additional column(s) of another table.

 

He defined these operations mathematically (i.e. with precision) and proved mathematically that these five operations are necessary and sufficient for all operations over tabular data structures (i.e. in the relation model), except for some statistical operations, such as average, minimum and maximum values, etc.  Observe that any single value can be thought of as the value of a table of one column and one row. This is the power of the model where all operations over one or more tables produce another table as the result, successively. Needless to say that the success of this model has been so staggering that people have now extended it even for some non-tabular data-structures.

 

The criteria we used to judge the success of a model is: does it make SENSE? The SENSE criteria are:

 

S Soundness: that is the model is sound.

E Employability: that is, the model can be employed directly on a problem, without any further and subjective assumptions. Many models fail on this, when the provisions of a model require further, often far-away, subjective interpretations before the model can be applied to solve a problem. Some PhD theses (models) fail utterly on this employability criteria.

N Necessity: that is, the assumptions made are as few as possible i.e. MINIMALLY necessary assumptions. It implies that all assumptions must be orthogonal with respect to one another, without any overlapping assumptions.

S Sufficiency: that is, the concepts used are sufficient – no further concepts are needed to solve the problems in that domain.

E Efficiency and effectiveness: that is, the solution is efficient and effective.

 

Everything should be made as simple as possible – but not simpler – A. Einstein.

Appendicies