DATA CAPTURE AND DATA PROCESSING – MASOMO MSINGI PUBLISHERS

Data Capture

Data capture refers to the Input of data, not as a direct result of data entry but as a result of performing a different but related activity eg Barcode reader equipped supermarket checkout counters, for example, capture inventory related data while recording a sale. Original data should be captured at source rather than use a processed or unstructured form of the data. The most appropriate data capture method to be applied depends on the nature of data to be captured and the application area.

Data capture has evolved significantly over the years from early versions that relied on simple character based OCR, to modern versions that incorporate Word recognition, zonal and document recognition as well as Artificial intelligence such as pattern recognition and machine learning to deliver the most accurate recognition for computer generated text.

With advances in cloud computing, AI and mobile technology, data capture has come such a long way in recent years and is required in some form in almost every digital process. The digital world can overlay and co-exist with the physical world and business operations to create new value and possibilities in personal lives as well as working lives.

METHODS OF DATA CAPTURE

Manual Keying

Manual keying or data entry is still relevant with certain types of unstructured data. Process Flows can provide manual keying, verification services and even hybrid automation as a managed service.

Manual keying of metadata from unstructured data is appropriate for data that is received in low volumes and results in low levels of recognition by intelligent data capture products.

Nearshore Keying

Nearshore keying is the same as manual keying but instead of the task being completed in-house it is delivered by a managed service or delivery centre. Nearshore keying of Metadata can be an appropriate option when there are highly variable documents being processed.

Artificial Intelligence

Artificial Intelligence is ultimately an umbrella term for different artificial intelligence techniques. AI is best viewed in context of the use, case and application. All of the methods described below can be augmented to some degree or another by Artificial intelligence such as:

Computer vision, Image or pattern recognition to improve the recognition of any type of image.
Neural Networks & Machine learning to assist with accurate recognition training based on large data sets and assisted learning.
Natural Language Processing for interpreting sentences and their meaning.

Intelligent Voice Capture

The boom in smart devices has also seen the rise of voice controlled virtual assistants from the likes of Apple (Siri), Google (Google Assistant), Amazon (Alexa) and Microsoft (Cortana). These are the best examples of voice capture being used mainstream in our everyday lives. There are now many applications of Voice data capture in businesses. For example, applications such as CX-E (CallXpress) and virtual assistants provide the ability to capture voice commands and initiate business processes, transcribe voice mails and other functions unifying verbal communications data with other channels. Contact centres are a good example of where the unification and integration of voice data alongside voice, instant messaging, email, fax and forms deliver enhanced customer experiences and business processes.

Source Data Entry Devices

Source data entry devices are used for audio input, video input and to enter the source document directly to the computer. Source data entry devices do not require data to be typed-in, keyed-in or pointed to a particular location.

Speech recognition Input Device:

The “Microphones – Speech Recognition” is a speech Input device. To operate it we require using a microphone to talk to the computer. Also, we need to add a sound card to the computer. A sound card translates analog audio signals from microphone into digital codes that the computer can store and process. Sound card also translates back the digital sound into analog signals that can be sent to the speakers. A speech recognition program can process the input and convert it into machine-recognized commands or input.

Digital Camera:

A digital camera can store many more pictures than an ordinary camera. Pictures taken using a digital camera are stored inside its memory and can be transferred to a computer by connecting the camera to it. A digital camera takes pictures by converting the light passing through the lens at the front into a digital image.

Scanner:

Scanner is an input device that accepts paper document as an input. Scanner is used to input data directly into the computer from the source document without copying and typing the data. The input data to be scanned can be a picture, a text or a mark on a paper. It is an optical input device and uses light as an input source to convert an image into an electronic form that can be stored on the computer. Scanner accepts the source paper document, scans the document and translates it into a bitmap image to be stored on the computer. The denser the bitmap, the higher is the resolution of the image. The quality of scan increases with the increase in resolution. Scanners come with utility software that allow the stored scanned documents to be edited, manipulated and printed.

Optical Character Recognition (OCR)

OCR is the ability of machine to recognize characters. OCR is a type of optical scanner, which can detect alphanumeric characters printed on paper. The OCR uses special light, or optic to read text from a piece of paper. A special font standard is needed to recognize character. The OCR system consist of combination of hardware and software to recognize characters. The advanced OCR system can read variety of fonts, but still have difficulty to read hand written text. OCR systems can recognise many different OCR fonts, as well as typewriter and computer-generated characters.

The OCR devices examine each character by analyzing point of characters then when the whole character is scanned, it is compared with standard fonts in which OCR devices are programmed to recognize the optical characters. OCR is used for large volume processing application such as reading of passenger tickets, processing motor vehicles registration etc.

OCR technology revolutionised data capture, digitisation and automation of back office operations that involved the processing of paper or digital documents such as PDF invoices, contracts, ID etc. As a technology it provides the ability to recognise machine produced characters as part of data capture and extraction process.

ICR (Intelligent Character Recognition)

ICR is the computer translation of hand printed and written characters. A scanned image of a handwritten document is analysed and recognised by sophisticated ICR software. ICR is similar to optical character recognition (OCR) but is a more difficult process since OCR is from printed text, as opposed to handwritten characters which are more variable.

6. Template Based Intelligent Capture

Templates are used to reduce variables and risks of failed data capture by optimising the capture process to certain document templates. This is combined with OCR & ICR to identify machine produced and to a lesser degree handwritten characters that are contained in particular area(s) of a document. This capability can be useful where the number of different document types being received are relatively low (typically up to 30 different document types) but consistent. Common applications include census, inter-bank transfers, logistics forms and application forms. Readsoft Forms uses this capability.

7.IDR (Intelligent Document Recognition)

Intelligent document recognition also interprets and indexes different documents based on the document type, its meta data and elements of the document identified. For example, invoices, letters, contracts, post codes, logos, key words, VAT registration numbers. Data that has been identified through OCR can be validated and verified through look-up tables and database as well configure or “taught” rules associated with such documents and data. and even databases to maximise accuracy.

Specialised applications exist for departmental projects such as invoice processing. IDR applications can hold information about suppliers generated from other line-of-business systems and match invoices to that information, using recognised text such as VAT number, telephone number, post code etc. The application then looks for keyword identifiers on the invoice and extrapolates the value nearby. Validation rules are then applied, for example the NET amount plus the VAT amount must equal the gross amount, minimising the chance for errors.

8. Intelligent Image and Video Capture

Intelligent image and video data capture involves real-time analysis of images and moving image data for objects or “triggers” before executing a certain process. There are a wide range for applications for automated image and video data capture and analytics including: health and Safety or QA monitoring, crowd and footfall analysis, sentiment analysis, facial recognition, ANPR (Automatic Number Plate Recognition), Fire detection or elevated temperatures in humans, animals or machinery (Thermal imaging), site protection, People counting, queue monitoring, patient activity, object counting, etc

Augmented Reality

Augmented reality is closely linked to video analysis and involves the real time processing of camera footage looking for programmed “trigger” objects. If a trigger object is identified, a process is executed for example display an overlay graphic, video or other web data. AR applications are increasing in popularity as the digital and physical worlds get closer ie Google Streetview, Skyview, Pokemon.

9. Optical Mark Recognition (OMR):

OMR is used to detect marks on a paper. The marks are recognized by their darkness. OMR uses an optical mark reader to read the marks. The OMR reader scans the forms, detects the mark that is positioned correctly on the paper and is darker than the surrounding paper, and passes this information to the computer for processing by application software. For this, it uses a beam of light that is reflected on the paper with marks, to capture presence and absence of marks. The optical mark reader detects the presence of mark by measuring the reflected light. The pattern of marks is interpreted and stored in the computer.

OMR is widely used to read answers of objective type tests, where the student marks an answer by darkening a particular circle using a pencil. OMR is also used to read forms, questionnaires, order forms, etc.

10. Magnetic Ink Character Recognition (MICR):

MICR is used in banks to process large volumes of cheques. It is used for recognizing the magnetic encoding numbers printed at the bottom of a cheque. The numbers on the cheque are human readable, and are printed using an ink which contains iron particles. These numbers are magnetized. MICR uses magnetic ink character reader for character recognition. When a cheque is passed through Magnetic Ink Character Reader, the magnetic field causes the read head to recognize the characters or numbers of cheque. The readers are generally used in banks to process the cheques. The numbers in the bottom of the cheque include the bank number, branch number and cheque number. The reading speed of MICR is faster than OCR.

11. Barcode Reader:

Barcodes are adjacent vertical lines of different width that are machine readable. Goods available at supermarkets, books, etc. use barcode for identification. Barcodes are read using reflective light by barcode readers. This information is input to the computer which interprets the code using the spacing and thickness of bars. Hand-held barcode readers are generally used in departmental stores to read the labels, and in libraries to read labels on books.

Barcode readers are fast and accurate. They enable faster service to the customer and are also used to determine the items being sold, number of each item sold or to retrieve the price of item

Hybrid Intelligent Automated Data Capture /Verification Services

Hybrid automation platform combines artificial intelligence with Human intelligence to offer the highest level of automated data capture of unstructured documents as service.

Digital Forms – Web or App

When collecting information from users, which doesn’t exist already, it often makes sense to capture the data through a digital form either on the web, via an intranet page or smartphone app. Digital forms can be designed to structure the answers and data collected by avoiding too many open answers. They can also dynamically adapt to responses and prepopulate where information exists already.

Digital Signatures

A valid digital signature associated with an email or document allows a user’s identity or the authenticity of digital messages or documents to be captured. Digital signatures are often used for digital approval workflows involving parties from different companies or entities.

Web Scraping or Monitoring

Since there is now a huge amount of data on the web, web scraping tools, called web bots or crawlers (ie. Google spiders) are used to crawl through web pages and code to collect, analyse and index specific data. Web scraping is used to capture and monitor many types of web data such as news, updates, prices, contacts, policies, share data, currencies, connected devices, comments and reviews – basically anything accessible via the web.

Screen Scraping

Screen scraping is used by Robot Process Automation and other tools to navigate, interact and capture raw data that appears on a display, digital display, application or website. Once the data is captured, it is then analysed to extract elements such as text and images etc and then a workflow executed to process the data is defined by the configured workflow rules.

Legacy System Integration or Data Import and Migration

If data can’t be accessed in a legacy system due to missing features or proprietary APIs, products such as Alchemy Datagrabber Module, Formate and OnBase allow organisations with legacy systems (mainframe systems) to ingest data for improved search and archival applications.

Examples include cheque requisition reports, property tax reports, invoice and credit note runs. The reports would be analysed by the application and broken down into individual records or pages. At the same time, index information is extracted from each record or page and associated with that record or page.

The full text content of the document is also made available for searching. To improve the presentation of the document to the end user, an overlay can be added. The Overlay can be a representation of the form or paper that the original report would have been printed on. Therefore, in the case of an invoice, the record resembles the original printed invoice. Datagrabber can also be used to import images, or files, along with indexing information extracted from a legacy system or from a manually created file. It can also be used to create the required structure of a database within Alchemy.

Swipe or Proximity Cards

Magnetic swipe or proximity cards are used to store data. Card readers capture this data to confirm identity and control to access to a building or shared device for example. Bank cards are also Magnetic cards but include additional security features such as the Chip and Pin.

Direct Data Entry Devices

Direct entry creates machine-readable data that can go directly to the CPU. It reduces human error that may occur during keyboard entry. Direct entry devices include pointing, scanning and voice-input devices.

POINTING DEVICES

Pen Input Devices e.g. Light pen

Pen input devices are used to select or input items by touching the screen with the pen. Light pens accomplish this by using a white cell at the tip of the pen. When the light pen is placed against the monitor, it closes a photoelectric circuit which identifies the spot for entering or modifying data. Engineers who design microprocessor chips or airplane parts use light pens.

Touch Sensitive Screen Inputs

Touch sensitive screens, or touch screens allow the user to execute programmes or select menu items by touching a portion of a special screen. Behind the plastic layer of the touch screen are crisscrossed invisible beams of infrared light. Touching the screen with a finger can activate actions or commands. Touch screens are often used in ATMs, information centres, restaurants and stores and Smart phones. They are popularly used at petrol stations for customers to select the grade of fuel or request a receipt at the pump (in developed countries), as well as in fast-food restaurants to allow clerks to easily enter orders.

SCANNING DEVICES

Scanning devices, or scanners, can be used to input images and character data directly into a computer. The scanner digitises the data into machine-readable form.

The scanning devices used in direct-entry include:
Image Scanner– converts images on a page to electronic signals.
Fax Machine – converts light and dark areas of an image into format that can be sent over telephone lines.
Bar-Code Readers / QR Recognition photoelectric scanner that reads vertical striped marks printed on items. The application of single or multiple bar codes to particular document types such as Proof of Delivery notes, membership forms, application forms, gift aid etc., can dramatically increase the effectiveness of a business process.

Depending on the type of barcode that is used, the amount of metadata that can be included or marked up can be high, as is the level of recognition. QR codes for example can contain webpage links ultimately linking a webpage of almost anything and any amount of information. Barcodes can be applied to documents, webpages or almost any objects for a range of purposes including inventory management, location or task tracking, webpage opening or authentication via authenticated app, production batch tracking, delivery notes, digital form locating and more. Smartphone with barcode applications have removed the need for dedicated barcode scanning tools making barcode use even more affordable.

Character and Mark Recognition Devices – scanning devices used to read marks on documents. Character and Mark Recognition Device Features Can be used by mainframe computers or powerful microcomputers.

There are several character and mark recognition devices:

Magnetic-Ink Character Recognition (MICR)

MICR readers are used to read the numbers printed at the bottom of checks in special magnetic ink. These numbers are an example of data that is both machine readable and human readable. This data capture technology is capable of recognising characters that are machine printed in a magnetic ink. It is mainly used in the bank industry for cheque processing.

Optical-Character Recognition (OCR)

Read special preprinted characters, such as those on utility and telephone bills. OCR systems can recognize many different OCR fonts, as well as typewriter and computer-printed characters. It can also be used to capture low to high volumes of data, where the information is in consistent location(s) on the documents

Optical-Mark Recognition (OMR)

This approach is used to capture human marked data on scanned forms, surveys and exams. Reads marks on tests – also called mark sensing. Optical mark recognition readers are often used for test scoring since they can read the location of marks on what is sometimes called a mark sense document. This is how, for instance, standardised tests such as the KCPE, SAT or GMAT are scored.

Intelligent Character Recognition (ICR)

ICR is the computer translation of hand printed and written characters. Data is entered from hand-printed forms through a scanner, and the image of the captured data is then analyzed and translated by sophisticated ICR software. ICR is similar to optical character recognition (OCR) but is a more difficult process since OCR is from printed text, as opposed to handwritten characters

VOICE–INPUT DEVICES

Voice-Input Devices can also be used for direct input into a computer. Speech recognition can be used for data input when it is necessary to keep the hands free. For example, a doctor may use voice recognition software to dictate medical notes while examining a patient. Voice recognition can also be used for security purposes to allow only authorised people into certain areas or to use certain devices.

Voice-input devices convert speech into a digital code.
The most widely used voice-input device is the microphone.
A microphone, sound card and software form a voice recognition system.

Note:

Point-of-sale (POS) terminals (electronic cash registers) use both keyboard and direct entry.

Keyboard Entry can be used to type in information.
Direct Entry can be used to read special characters on price tags.
Point-of-sale terminals can use wand readers or platform scanners as direct entry devices.
Wand readers or scanners reflect light on the characters.
Reflection is changed by photoelectric cells to machine-readable code.
Encoded information on the product’s barcode e.g. price appear on terminal’s digital display.

Data Processing

Data processing refers to the rules by which data is converted into useful information. A data processing system is an application that is optimized for a certain type of data processing. The simplest example of data processing is data visualization. Data undergoes a series of conversion operations to produce data analysis reports, for example in the form of graphs.

Data processing are those activities, which are concerned with the systematic recording, arranging, filing, processing and dissemination of facts relating to the physical events occurring in the business. Data processing can also be described as the activity of manipulating the raw facts to generate a set or an assembly of meaningful data (information). Data processing activities include data collection, classification, sorting, adding, merging, summarising, storing, retrieval and dissemination.

The black box model is an extremely simple principle of a machine, that is, irrespective of how a machine operates internally, it takes an input, operates on it and then produces an output.

Processing

This data consists of: numerical data, character data and special (control) characters. Use of computers for data processing involves four stages:

Data Input – This is the process of capturing data into the computer system for processing. Input devices are used.
Storage – This is an intermediary stage where input data is stored within the computer system or on secondary storage awaiting processing or output after processing. Programme instructions to operate on the data are also stored in the computer.
Processing – The central processing unit of the computer manipulates data using arithmetic and logical operations.
Data Output – The results of the processing function are output by the computer using a variety of output devices.

Data processing activities include:

Recording – bringing facts into a processing system in usable form.
Classifying – data with similar characteristics are placed in the same category, or group.
Sorting – arrangement of data items in a desired sequence.
Calculating – applying arithmetic functions to data.
Summarizing – condensing data or ting put it in a briefer form.
Comparing – performing an evaluation in relation to some known measures.
Communicating – the process of sharing information.
Storing – to hold processed data for continuing or later use.
Retrieving – to recover data previously stored.

Information processing

This is the process of turning data into information by making it useful to some person or process.

Computer files – A file is a collection of related data or information that is normally maintained on a secondary storage device. The purpose of a file is to keep data in a convenient location where they can be located and retrieved as needed. Two general types of files are used in computerised information systems: master files and transaction files.

Master Files

Contain information to be retained over a relatively long period of long time. The information is updated continuously to represent the current status of the business. Eg in an accounts receivable file which is maintained by companies that sell to customers on credit, each account record contains such information as account number, customer name and address, credit limit amount, the current balance owed, and fields indicating the dates and amounts of purchases during the current reporting period. When a new purchase is made the account is updated and a new account balance is computed and compared with the credit limit. If the new balance exceeds the credit limit, an exception report may be issued and the order may be held up pending management approval.

Transaction Files

Transaction files contain records reflecting current business activities. Records in transaction files are used to update master files. From the illustration above, records containing data on customer orders are entered into transaction files. These transaction files are then processed to update the master files. This is known as posting transaction data to master file.

Preparing Data for Data Processing

Before data can be processed and analyzed, it needs to be prepared, so it can be read by algorithms. Raw data needs to undergo ETL – extract, transform, load – to get to the data warehouse for processing. Xplenty simplifies the task of preparing data for analysis. With a cloud platform, ETL data pipelines can be built within minutes. The simple graphical interface does away with the need to write complex code.

Methods of Data Processing

There are several different types of data processing, which differ in terms of availability, atomicity, and concurrency, among other factors. The method of data processing employed will determine the response time to a query and how reliable the output is.

The following are the most common types of data processing and their applications.

Transaction Processing

Transaction processing is deployed in mission-critical situations. These are situations, which, if disrupted, will adversely affect business operations. In transaction processing, availability is the most important factor. Availability can be influenced by factors such as:

Hardware: A transaction processing system should have redundant hardware. Hardware redundancy allows for partial failures, since redundant components can be automated to take over and keep the system running.
Software: The software of a transaction processing system should be designed to recover quickly from a failure. Typically, transaction processing systems use transaction abstraction to achieve this- in case of a failure, uncommitted transactions are aborted. This allows the system to reboot quickly.

Distributed Processing

Very often, datasets are too big to fit on one machine. Distributed data processing breaks down these large datasets and stores them across multiple machines or servers. It rests on Hadoop Distributed File System (HDFS). A distributed data processing system has a high fault tolerance. If one server in the network fails, the data processing tasks can be reallocated to other available servers. Distributed processing can be cost-saving so there will be need to build expensive mainframe computers anymore orinvest in their upkeep and maintenance. Stream processing and batch processing are common examples of distributed processing.

Bottom of Form

3. Real-time Processing

Real-Time Processing involves continuous input, process, and output of data. Hence, it processes in a short period of time. There are some programs which use such data processing type. For example, bank ATMs, customer services, radar systems, and Point of Sale (POS) Systems. Every transaction is directly reflected in the master file, with this data process. So, that it will always be up-to-date.
Spark Real-Time processing is key in producing analytics results in real time,. Data is fed into analytics tools, by building data streams, as soon as it is generated. It gets near-instant analytics results by using platforms like Spark Streaming.
For tasks like fraud detection, real-time processing is very useful. It can detect that signal fraud in real time and stop fraudulent transactions before they take place, through real-time processing.

Real-time processing is similar to transaction processing, in that it is used in situations where output is expected in real-time. However, the two differ in terms of how they handle data loss. Real-time processing computes incoming data as quickly as possible. If it encounters an error in incoming data, it ignores the error and moves to the next chunk of data coming in. GPS-tracking applications are the most common example of real-time data processing. With transaction processing, in case of an error, such as a system failure, ongoing processing is aborted and the system reinitializes. Real-time processing is preferred over transaction processing.

In the world of data analytics, stream processing is a common application of real-time data processing. Stream processing analyzes data as it comes in e.g. tracking consumer activity in real-time. Google Big Query and Snowflake are examples of cloud data platforms that employ real-time processing.

Real-Time processing system:

helps to compute a function of one data element and also a smallish window of recent data.
computes something relatively simple
computes in near-real-time, in seconds at most.
computations are generally independent.
is asynchronous in nature – a source of data doesn’t interact with the stream processing directly.

Advantages of Real-Time Processing

there is no significant delay in response.
information is always up to date enabling the organization to take immediate action and also responds to an event, issue or scenario in the shortest possible time.
It also makes the organization able to gain insights from the updated data. Even helps to detect patterns of possible identification of either opportunities or threats.

Disadvantages of Real-Time Processing

It is very complex as well as expensive.
It is very difficult for auditing.
It is a bit tedious.

Batch Processing

Batch processing is when chunks of data, stored over a period of time, are analyzed together, or in batches. Batch processing is required when a large volume of data needs to be analyzed for detailed insights. For example, sales figures of a company over a period of time will typically undergo batch processing. Since there is a large volume of data involved, the system will take time to process it. By processing the data in batches, it saves on computational resources.

Batch processing is preferred over real-time processing when accuracy is more important than speed. Additionally, the efficiency of batch processing is also measured in terms of throughput. Throughput is the amount of data processed per unit time.

It is an efficient way of processing high/large volumes of data especially where a group of transactions is collected over a period of time. Data is collected, entered and processed. Afterward, it produces batch results. Hadoop works on batch data processing. For input, process, and output, batch processing requires separate programs. Payroll and billing systems are examples of batch processing.

Example

A sales team gathers information throughout a specified period of time. Afterwards, all that information is entered into the system all at once. This whole procedure is known as Batch Processing. Generally, it works for printing shipping labels, packing slips and payment processing – waiting to do everything all at once.

Batch processing system

accesses all data.
computes big and complex data.
Is concerned with throughput rather than the latency of individual components of the computation.
has latency measured in minutes or more.

Advantages of Batch Processing

It is ideal for processing large volumes of data/transaction.
It also increases efficiency rather than processing each individually.
Processing can be done independently. Even during less-busy times or at a desired designated time.
It offers cost efficiency.
It allows a good audit trail.

Disadvantages of Batch Processing

causes delay between the collection of data and getting the result after the batch process.
the batch processing master file is not always up to date.
Here, a one-time process can be very slow.

Multiprocessing

Multiprocessing is the method of data processing where two or more than two processors work on the same dataset. It is different from distributed processing in that different processors reside within the same system. Thus, they are present in the same geographical location. If there is a component failure, it can reduce the speed of the system.

In distributed processing, servers are independent of each other and can be present in different geographical locations. Since almost all systems today come with the ability to process data in parallel, almost every data processing system uses multiprocessing. However, multiprocessing can be seen as having an on-premise data processing system. Typically, companies that handle very sensitive information might choose on-premise data processing as opposed to distributed processing. For example, pharmaceutical companies or businesses working in the oil and gas extraction industry. The most drawback here is cost as building and maintaining in-house servers is very expensive.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

Written by MJ