Introduction to Statistics; Collection of Data


What do we mean by “Statistics”?

“Statistics” is a word which is used in a variety of ways and with a variety of meanings, but, in whatever way it is used, it is always concerned with numerical information. There are two particular meanings of the word which concern us, namely:

  • The numerical facts themselves: for example, we talk of the “statistics” of steel production.
  • The methods of analysing the facts: in this sense, “Statistics” is the title of a subject like “Arithmetic” or “Chemistry” or “Physics”; sometimes the subject is called “Statistical Method”.


As a subject, Statistics is a branch of science, a branch of science which deals with facts and figures; if you have a lot of numerical information about any topic, then statistical methods help you to extract the most value from it. Like mathematics, statistics is quite general; it does not matter what the figures are about, the methods still apply. Whether you are a businessman, a scientist or an accountant, the methods of analysing your facts and figures are very similar.

One of the main features of Statistical Methods is that it deals with things in groups rather than with individuals. In comparing, say, the height of Frenchmen with the height of Englishmen, we are concerned with Frenchmen in general and Englishmen in general, but not with Marcel and John as individuals. An insurance company, to give another example, is interested in the proportion of men (or women) who die at certain ages, but it is not concerned with the age at which John Kimuda (or Mary Kimenyi), as individuals, will die.

Importance of Statistics in Business

Many people think of Statistics as part of Economics but, as we have already mentioned, the subject is much more general than that. It is true, however, that economic and business situations very often provide the kind of data which is best analysed by statistical methods and which, without such methods, is either meaningless or misleading. For this reason it is important that anyone engaged in business or industry should have some sound knowledge of Statistics. In this way he or she will be able to use the methods of Statistics to help make decisions and also, what may sometimes be more important, he or she will know how best to make use of the services of professional statisticians.

With these considerations in mind, most professional bodies concerned with business affairs include Statistics as a subject in their examinations.

Main Stages in a Statistical Investigation

Firstly we must define the problem and decide exactly what it is we want to know or to predict. Collection of relevant data follows. The classification and analysis of this data and finally the presentation of the results completes the statistical investigation.

The Subject of Statistics

As you study the subject of Statistics, you should bear in mind the following points:


  • Statistical methods are not a “sausage machine” giving set answers to set questions. They are more like the tools in a tool chest, and for any particular job a good deal of thought and perhaps some trial and error may be needed before the correct tool is chosen and used.
  • In real life, statistical work often involves extensive calculations, but our purpose is to learn principles and methods rather than to do lots of arithmetic. Consequently, most

of our examples will contain relatively few figures, but remember that in practice one usually (but not always) has to apply the methods to a much larger mass of data.

  • Some statistical methods are based on advanced mathematics, but do not be put off by that. For this course we can take the mathematics for granted or learn it as we go along, and we shall not require anything but ordinary arithmetic and some very simple algebra.




Even before the collection of data starts, there are some important points to consider when planning a statistical investigation. Shortly I will give you a list of these together with a few notes on each; some of them you may think obvious or trivial, but do not neglect to learn them because they are very often the points which are overlooked. Furthermore, examiners like to have lists as complete as possible when they ask for them!


What, then, are these preliminary matters?

Exact Definition of the Problem

This is necessary in order to ensure that nothing important is omitted from the enquiry, and that effort is not wasted by collecting irrelevant data. The problem as originally put to the statistician is often of a very general type and it needs to be specified precisely before work can begin.

Definition of the Units

The results must appear in comparable units for any analysis to be valid. If the analysis is going to involve comparisons, then the data must all be in the same units. It is no use just asking for “output” from several factories – some may give their answers in numbers of items, some in weight of items, some in number of inspected batches and so on.

Scope of the Enquiry

No investigation should be got under way without defining the field to be covered. Are we interested in all departments of our business, or only some? Are we to concern ourselves with our own business only, or with others of the same kind?

Accuracy of the Data

To what degree of accuracy is data to be recorded? For example, are ages of individuals to be given to the nearest year or to the nearest month or as the number of completed years? If some of the data is to come from measurements, then the accuracy of the measuring instrument will determine the accuracy of the results. The degree of precision required in an estimate might affect the amount of data we need to collect. In general, the more precisely we wish to estimate a value, the more readings we need to take.


Primary and Secondary Data

In its strictest sense, primary data is data which is both original and has been obtained in order to solve the specific problem in hand. Primary data is therefore raw data and has to be classified and processed using appropriate statistical methods in order to reach a solution to the problem.

Secondary data is any data other than primary data. Thus it includes any data which has been subject to the processes of classification or tabulation or which has resulted from the application of statistical methods to primary data, and all published statistics.

Quantitative/Qualitative Categorisation

Variables may be either quantitative or qualitative. Quantitative variables, to which we shall restrict discussion here, are those for which observations are numerical in nature. Qualitative variables have non-numeric observations, such as colour of hair, although, of course, each possible non-numeric value may be associated with a numeric frequency.

Continuous/Discrete Categorisation

Variables may be either continuous or discrete. A continuous variable may take any value between two stated limits (which may possibly be minus and plus infinity). Height, for example, is a continuous variable, because a person’s height may (with appropriately accurate equipment) be measured to any minute fraction of a millimetre. A discrete variable, however, can take only certain values occurring at intervals between stated limits. For most (but not all) discrete variables, these interval values are the set of integers (whole numbers).

For example, if the variable is the number of children per family, then the only possible values are 0, 1, 2, … etc. because it is impossible to have other than a whole number of children. However, in Ireland, shoe sizes are stated in half-units, and so here we have an example of a discrete variable which can take the values 1, 11/2, 2, 21/2, etc.



Having decided upon the preliminary matters about the investigation, the statistician must look in more detail at the actual data to be collected. The desirable qualities of statistical data are the following:


  • Homogeneity
  • Completeness – Accurate definition  –



The data must be in properly comparable units. “Five houses” means little since five  dwelling housesare very different from five ancestral castles. Houses cannot be compared unless they are of a similar size or value. If the data is found not to be homogeneous, there are two methods of adjustment possible.

  • Break down the group into smaller component groups which are homogeneous and study them separately.
  • Standardise the data. Use units such as “output per man-hour” to compare the output of two factories of very different size. Alternatively, determine a relationship between the different units so that all may be expressed in terms of one; in food consumption surveys, for example, a child may be considered equal to half an adult.




Great care must be taken to ensure that no important aspect is omitted from the enquiry.

Accurate Definition

Each term used in an investigation must be carefully defined; it is so easy to be slack about this and to run into trouble. For example, the term “accident” may mean quite different things to the injured party, the police and the insurance company! Watch out also, when using other people’s statistics, for changes in definition. Laws may, for example, alter the definition of an “indictable offence” or of an “unemployed person”.


The circumstances of the data must remain the same throughout the whole investigation. It is no use, for example, comparing the average age of workers in an industry at two different times if the age structure has changed markedly. Likewise, it is not much use comparing a firm’s profits at two different times if the working capital has changed.


When all the foregoing matters have been dealt with, we come to the question of how to collect the data we require. The methods usually available are as follows:


  • Use of published statistics
  • Personal investigation/interview – Delegated personal investigation/interview –
Published Statistics

Sometimes we may be attempting to solve a problem that does not require us to collect new information, but only to reassemble and reanalyse data which has already been collected by someone else for some other purpose.

We can often make good use of the great amount of statistical data published by governments, the United Nations, nationalised industries, chambers of trade and commerce and so on. When using this method, it is particularly important to be clear on the definition of terms and units and on the accuracy of the data. The source must be reliable and the information up-to-date.

This type of data is sometimes referred to as secondary data in that the investigator himself has not been responsible for collecting it and it thus came to him “second-hand”. By contrast, data which has been collected by the investigator for the particular survey in hand is called primary data.

The information you require may not be found in one source but parts may appear in several different sources. Although the search through these may be time-consuming, it can lead to data being obtained relatively cheaply and this is one of the advantages of this type of data collection. Of course, the disadvantage is that you could spend a considerable amount of time looking for information which may not be available.

Another disadvantage of using data from published sources is that the definitions used for variables and units may not be the same as those you wish to use. It is sometimes difficult to establish the definitions from published information, but, before using the data, you must establish what it represent


Personal Investigation/Interview

In this method the investigator collects the data himself. The field he can cover is, naturally, limited. The method has the advantage that the data will be collected in a uniform manner and with the subsequent analysis in mind. There is sometimes a danger to be guarded against though, namely that the investigator may be tempted to select data that accords with some of his preconceived notions.

The personal investigation method is also useful if a pilot survey is carried out prior to the main survey, as personal investigation will reveal the problems that are likely to occur.

Delegated Personal Investigation/Interview

When the field to be covered is extensive, the task of collecting information may be too great for one person. Then a team of selected and trained investigators or interviewers may be used. The people employed should be properly trained and informed of the purposes of the investigation; their instructions must be very carefully prepared to ensure that the results are in accordance with the “requirements” described in the previous section of this study unit. If there are many investigators, personal biases may tend to cancel out.

Care in allocating the duties to the investigators can reduce the risks of bias. For example, if you are investigating the public attitude to a new drug in two towns, do not put investigator A to explore town X and investigator B to explore town Y, because any difference that is revealed might be due to the towns being different, or it might be due to different personal biases on the part of the two investigators. In such a case, you would try to get both people to do part of each town.



In some enquiries the data consists of information which must be supplied by a large number of people. Then a very convenient way to collect the data is to issue questionnaire forms to the people concerned and ask them to fill in the answers to a set of printed questions. This method is usually cheaper than delegated personal investigation and can cover a wider field. A carefully thought-out questionnaire is often also used in the previous methods of investigation in order to reduce the effect of personal bias.

The distribution and collection of questionnaires by post suffers from two main drawbacks:


  • The forms are completed by people who may be unaware of some of the requirements and who may place different interpretations on the questions – even the most carefully worded ones!
  • There may be a large number of forms not returned, and these may be mainly by people who are not interested in the subject or who are hostile to the enquiry. The result is that we end up with completed forms only from a certain kind of person and thus have a biased sample.


It is essential to include a reply-paid envelope to encourage people to respond.


If the forms are distributed and collected by interviewers, a greater response is likely and queries can be answered. This is the method used, for example, in the Population Census. Care must be taken, however, that the interviewers do not lead respondents in any way.


Advantages of Interviewing

There are many advantages of using interviewers in order to collect information.

The major one is that a large amount of data can be collected relatively quickly and cheaply. If you have selected the respondents properly and trained the interviewers thoroughly, then there should be few problems with the collection of the data.

This method has the added advantage of being very versatile since a good interviewer can adapt the interview to the needs of the respondent. Similarly, if the answers given to the questions are not clear, then the interviewer can ask the respondent to elaborate on them. When this is necessary, the interviewer must be very careful not to lead the respondent into altering rather than clarifying the original answers. The technique for dealing with this problem must be tackled at the training stage.

This “face-to-face” technique will usually produce a high response rate. The response rate is determined by the proportion of interviews that are successful.

Another advantage of this method of collecting data is that with a well-designed questionnaire it is possible to ask a large number of short questions of the respondent in one interview. This naturally means that the cost per question is lower than in any other method.


Disadvantages of Interviewing

Probably the biggest disadvantage of this method of collecting data is that the use of a large number of interviewers leads to a loss of direct control by the planners of the survey. Mistakes in selecting interviewers and any inadequacy of the training programme may not be recognised until the interpretative stage of the survey is reached. This highlights the need to train interviewers correctly. It is particularly important to ensure that all interviewers ask questions in a similar manner. Even with the best will in the world, it is possible that an inexperienced interviewer, just by changing the tone of his or her voice, may give a different emphasis to a question than was originally intended.

In spite of these difficulties, this method of data collection is widely used as questions can be answered cheaply and quickly and, given the correct approach, the technique can achieve high response rates.


                       DESIGNING THE QUESTIONNAIRE



A “questionnaire” can be defined as  “a formulated series of questions, an interrogatory” and this is precisely what it is. For a statistical enquiry, the questionnaire consists of a sheet (or possibly sheets) of paper on which there is a list of questions the answers to which will form the data to be analysed. When we talk about the “questionnaire method” of collecting data, we usually have in mind that the questionnaires are sent out by post or are delivered at people’s homes or offices and left for them to complete. In fact, however, the method is very often used as a tool in the personal investigation methods already described.

The principles to be observed when designing a questionnaire are as follows:


  • Keep it as short as possible, consistent with getting the right results.
  • Explain the purpose of the investigation so as to encourage people to give the answers.
  • Individual questions should be as short and simple as possible.
  • If possible, only short and definite answers like “Yes”, “No”, or a number of some sort should be called for.
  • Questions should be capable of only one interpretation.
  • There should be a clear logic in the order in which the questions are asked.
  • There should be no leading questions which suggest the preferred answer.
  • The layout should allow easy transfer for computer input.
  • Where possible, use the “alternative answer” system in which the respondent has to choose between several specified answers.
  • The respondent should be assured that the answers will be treated confidentially and that the truth will not be used to his or her detriment.
  • No calculations should be required of the respondent.


The above principles should always be applied when designing a questionnaire and, in addition, you should understand them well enough to be able to remember them all if you are asked for them in an examination question. They are principles and not rigid rules – often one has to go against some of them in order to get the right information. Governments can often ignore these principles because they can make the completion of the questionnaire compulsory by law, but other investigators must follow the rules as far as practicable in order to make the questionnaire as easy to complete as possible – otherwise they will receive no replies.



Choice is difficult between the various methods, as the type of information required will often determine the method of collection. If the data is easily obtained by automatic methods or can be observed by the human eye without a great deal of trouble, then the choice is easy. The problem comes when it is necessary to obtain information by questioning respondents. The best guide is to ask yourself whether the information you want requires an attitude or opinion or whether it can be acquired from short yes/no type or similar simple answers. If it is the former, then it is best to use an interviewer to get the information; if the latter type of data is required, then a postal questionnaire would be more useful.

Do not forget to check published sources first to see if the information can be found from data collected for another survey.

Another yardstick worth using is time. If the data must be collected quickly, then use an interviewer and a short simple questionnaire. However, if time is less important than cost, then use a postal questionnaire, since this method may take a long time to collect relatively limited data, but is cheap.

Sometimes a question in the examination paper is devoted to this subject. The tendency is for the question to state the type of information required and ask you to describe the appropriate method of data collection giving reasons for your choice.

More commonly, specific definitions and explanations of various terms, such as interviewer bias, are contained in multi-part questions.


The auditor relies very heavily on data collection to perform his or her job. It would clearly be extremely costly to do a complete (100%) check of all records relating to the particular financial period under review. Also, this is not really necessary, since the auditor does not wish to prove that the financial statements are exactly correct.

Years ago, judgmental (or non-statistical) sampling was used very widely in auditing. The sample size and composition was determined purely by the auditor, and a large proportion (20-25%) of the records were normally checked. For example, the auditor might do a complete check on March, August and November during the course of an audit for one particular year. However, as businesses have increased in both size and complexity, there has been an ever-increasing volume of relevant documentation, and this has led to a move towards statistical sampling.

Usefulness of Statistical Sampling

Statistical sampling is now used almost exclusively, and is superior because it allows the auditor to quantify the estimates and the risks involved in his or her checking. It is primarily useful where there is a large number of small items, e.g. physical check on stock items, payroll check, and petty cash vouchers. It is not useful where small numbers or unusual items are concerned, e.g. material items of capital expenditure, directors’ expenses and remuneration, and non-recurring items.

Methods Used

When you come to Study Unit 3, which outlines the various methods of taking samples, remember that all these methods are applicable to the auditor. The size of the sample has to be chosen carefully, and other factors need to be considered here. For example:

  • How good is the internal control?
  • How material is the area to be tested?
  • How much precision is required?
  • What is the inherent risk for this area? For example, petty cash is high, whereas fixed assets are low risk

There are three basic methods for taking samples.


  • Acceptance sampling includes a pre-defined level of error. When the sample is selected, the results obtained are compared with this pre-defined level. The whole area is accepted or rejected on this basis.
  • Discovery sampling involves looking for particular items. For example, when testing the standard of internal control, discovery sampling could be used to look for items which do not conform.
  • Estimation sampling is the most widely used method. Here a sample is taken and the results are used to estimate the proportion or the amount prevalent in the whole population. This idea will be expanded in Study Unit 8.




(Visited 114 times, 1 visits today)
Share this:

Written by 

Leave a Reply