Implementation7 min read

What Data You Need in Order Before Implementing AI

Byron CarranzaCTO

June 22, 2026

TLDR

The most common misconception before an AI project is that you need perfect data to begin. The reality is different: what you need is to understand what data you have. Messy data is fixable; missing data requires a capture strategy before the project can move forward. Knowing which problem you have determines your actual starting point.

The perfect data myth

"We need to get our data in order first." This is one of the most common things we hear when a company starts evaluating an AI project. The underlying intuition is reasonable — if the input data is bad, the system's output will be bad too.

The problem is that "getting data in order" without a specific objective can become an endless project. The data of any medium-sized business will always have inconsistencies, empty fields, mixed formats, and duplicates. Waiting for perfection means waiting indefinitely.

What actually matters — and what few companies do before starting — is understanding what data they have available, what condition it is in, and what the difference is between a quality problem and an absence problem. That distinction determines whether a project can start now or whether it requires prior work before the AI layer makes sense.

Two very different problems: messy data versus missing data

Messy data

Messy data exists — it is there, in some system or file — but it has problems of format, consistency, or partial completeness. A phone number field where some records include the country code and others do not. Client names entered in different formats by different people over the years. Dates stored in three different formats in the same column.

This type of problem is manageable. It is not trivial, but it has a technical solution: data cleaning, normalization, validation rules. In most AI projects for medium-sized businesses, messy data represents additional work measured in days or weeks, not a fundamental blocker.

Missing data

Missing data is a different category entirely. This is not about data that is poorly formatted — it is about data that was never captured because nobody defined it as important, because the process had no capture point, or because the tool being used did not support it.

A concrete example: a company wants to build an agent that predicts which clients are most likely to renew their contract. To do that, it needs interaction history by client — calls, emails, reported incidents, satisfaction scores. If that history was never logged in a structured way, there is no data quality problem. There is a data absence. Cleaning what exists will not fix that.

The distinction matters because the solutions are completely different. Messy data gets cleaned. Missing data requires redesigning capture processes and waiting for sufficient volume to accumulate before it becomes useful.

Minimum viable data hygiene for common AI projects

Depending on the type of project, specific datasets need to be in reasonable shape before work can begin productively.

Customer service and support agents

What is needed: a history of requests or tickets with date, category, channel, and resolution. They do not need to be perfectly classified, but the basic information — what was requested and how it was resolved — needs to exist in structured form.

What blocks the project: no structured history at all, or data that lives in email threads that were never converted into records.

Proposal and commercial document automation

What is needed: existing templates or well-formed document examples, client information with consistent fields (name, industry, size, services contracted), and a service or product catalog with descriptions.

What blocks the project: client information fragmented across multiple systems without a common identifier, or a service catalog that exists only in the heads of one or two people.

Operations analysis and reporting

What is needed: transactional data with dates, amounts, and consistent categories. Not perfect — but with reproducible logic. If the same type of transaction appears under three different names in the database, the system needs an equivalence table before it can reason about patterns.

What blocks the project: data that exists only in manually generated PDF reports, or historical records that have never been digitized.

Inventory and logistics management

What is needed: a product or SKU catalog with unique identifiers, movement records with dates, and supplier or delivery point data.

What blocks the project: an inventory managed primarily in local spreadsheets that different people update independently without synchronization — so there is no single authoritative version of the data.

The three questions to ask about any dataset

Before deciding whether a given area's data is ready for an AI project, three questions orient the diagnosis.

Does the data exist, even if it is messy?

If the answer is yes, the problem is quality. It is technical work, it has a solution, and the effort can be estimated. If the answer is no, the problem is capture. The process for collecting that data needs to be designed before the AI layer can use it.

Is there a consistent unique identifier?

The unique identifier is what allows data from different sources to be connected. The client number, the product code, the order number. If that identifier exists and is consistent across systems, many quality problems become solvable. If it does not exist — if the same client has different identifiers in the CRM, the accounting system, and the tracking spreadsheet — there is a data architecture problem that needs to be resolved before automation can work reliably.

Who can explain the logic of the data?

Every dataset has implicit rules that were never documented: why some records have a certain field and others do not, what a blank value means versus a zero value, what changed in the process two years ago that explains why older records have a different format. The person who carries that knowledge — usually someone on the team who has been working with that data for years — is a critical resource at the start of any project. If that knowledge lives in a single person and that person is not available during the project, the technical team will make decisions about the data without the context that makes those decisions correct.

A practical example: B2B services company in Costa Rica

A professional services company with 35 employees wants to build an agent that automates commercial proposal generation. They have four years of sent proposals, a CRM with client information, and a service catalog.

The data review reveals the following. Previous proposals exist as Word files without consistent structure — the first two years use one format, the last two use another. Client records in the CRM have industry and size fields filled in for about 60% of entries. The service catalog exists as a sales presentation, not as a structured document.

The diagnosis: the proposals are messy data — there is standardization work to do, but the data exists. The missing CRM fields are partially recoverable — the sales team can complete the most important records in about a week. The service catalog is missing data in structured form — that document needs to be created before the agent can use it.

The result: the project does not get blocked, but it starts with two weeks of parallel data preparation alongside the technical build. The agent reaches production with a clean, complete data foundation, and that determines the quality of every output from day one.

Is your company considering an AI project and is not sure about the state of your data? Schedule a session and we will review what you have available, identify what needs cleaning versus what is missing entirely, and define the real starting point of the project — without assumptions.