Problem Understanding (aka Define Business Problem)
Charles Kettering once quoted “A problem well stated is a problem half solved”. If you can define the problem, solving becomes easy. Business problems are current or long term challenges and issues faced by a business. These may prevent a business from executing strategy and achieving goals.
It is necessary to define clear data science objective and this requires an understanding of how value and information flows in a business, and the ability to use that understanding to identifies business opportunities.
Decompose To Machine Learning Tasks
Effectively translating business requirements to a data-driven solution is key to the success of your data science project. We’ve only defined the problem in business terms. Before any machine learning happens, we need to move from monetary units and switch to other KPIs that our machine learning team can understand. Here, we also need to identify the machine learning problem categories.
The data preparation process is one of the main challenges that plague most projects. According to a recent study, data preparation tasks take more than 80% of the time spent on ML projects. Data scientists spend most of their time on data cleaning (25%), labeling (25%), augmentation (15%), aggregation (15%), and identification (5%).

1. Data Collection
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. Data collection is the procedure of collecting, measuring and analyzing accurate insights for research using standard validated techniques.In most cases, data collection is the primary and most important step for research, irrespective of the field of research. The approach of data collection is different for different fields of study, depending on the required information. Accurate data collection is essential to maintaining the integrity of research.
- Data Augmentation: Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model.
- Data Labelling: Data labeling is the process of detecting and tagging unstructured data to structured datasets for Machine Learning algorithms. To continue growing the AI industry, data labeling is a very necessary step. Data Labeling plays a very important role in machine learning and AI-based projects. Thus, If there is no labeled data there will be no machine learning model Thus, To continue growing the AI & ML industry, Data Labeling is a very important step.
2. Data Pre-processing
Data in the real world is dirty and needs to be free from all the discrepancies. Some models work only if the data provided is in a specific format and free from errors.After you have selected the data, you need to consider how you are going to use the data.
- Formatting: The data you have selected may not be in a format that is suitable for you to work with.The data may be in a relational database and you would like it in a flat file, or the data may be in a proprietary file format and you would like it in a relational database or a text file.

- Data Discretization: Part of data reduction but with particular importance, especially for numerical data.
- Data Cleaning: Cleaning data is the removal or fixing of missing data.There may be data instances that are incomplete and do not carry the data you believe you need to address the problem.
- Data integration: Integration of multiple databases, data cubes, or files.
- Data Transformation: Normalization and aggregation.
- Sampling: There may be far more selected data available than you need to work with.
- Data reduction: Obtains reduced representation in volume but produces the same or similar analytical results.
3. Exploratory Data Analysis
The main purpose of EDA is to help look at data before making any assumptions. Without a proper EDA, Machine Learning work suffer from accuracy issues and many times, the algorithms won’t work. Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals.
4. Model training
Model training in machine language is the process of feeding an ML algorithm with data to help identify and learn good values for all attributes involved. There are several types of machine learning models, of which the most common ones are supervised and unsupervised learning.
Supervised learning is possible when the training data contains both the input and output values. Each set of data that has the inputs and the expected output is called a supervisory signal. The training is done based on the deviation of the processed result from the documented result when the inputs are fed into the model.
Unsupervised learning involves determining patterns in the data. Additional data is then used to fit patterns or clusters. This is also an iterative process that improves the accuracy based on the correlation to the expected patterns or clusters. There is no reference output dataset in this method.
5. Model Evaluation
Once the Model is trained, this is required to be done to select one model out of many models which is get trained. The following techniques need to be adopted for evaluating models:
- Basic parameters
- Resampling methods
- Cross-validation
- Statistical tests
- Evaluation Matrices
6. Conclusion
Well-prepared data is crucial for the success of machine learning models. However, data preparation is a time-intensive and sensitive process that is full of challenges. Therefore, self-service data preparation tools have been designed to enhance the productivity of data scientists and accelerate the performance of ML models.
Kedar Bhingarde
Jr. Data Scientist