Why build this instead of using Keepa, Helium 10 or similar product-research tools?

Because those tools answer a different question. They start from the marketplace and tell you whether a given ASIN looks attractive. This problem is inverted: it starts from roughly 300,000 wholesale part numbers and needs the join between supplier catalogs and the marketplace — code normalization across formats, multi-supplier cost comparison, availability and lead time. No off-the-shelf tool holds your cost side. In fact, the pipeline uses such tools as one of its data sources; what was built is the join, the margin model and the decision trail on top of them — not a replacement for the data layer.

How do you know the LLM classifier isn't quietly wrong?

Three containment mechanisms, described in Stage 4: a mandatory uncertain class (10.3% of listings) routed to human review and never acted on automatically; an asymmetric, capped misclassification cost, because a wrongly included listing still has to survive the margin and competition gates downstream — a false positive costs one manual review, not a purchase; and spot-checks on samples of verdicts during tuning. There is no large hand-labeled benchmark behind it yet, and the article says so deliberately. The funnel is the safety net: the classifier filters, it never decides alone.

What did the AI part actually cost to run?

LLM classification was a marginal fraction of total cost — roughly ten thousand short text classifications, run once and cached forever. The dominant line item is market-data API credits, which is precisely why aggressive caching is the central economic decision of the pipeline, not an optimization. Absolute figures age badly and depend on volumes; the structural point does not: in a pipeline like this, the model is cheap and the data is expensive, so the architecture should protect the data.

Does the pipeline buy products automatically?

No, and it should not. Every product that survives the funnel arrives at a human review queue with its full decision trail: matched identifiers, margin at the average and at the historical floor, competition picture, demand indicators with known bias, classification label with confidence. The system's job is to make that review fast and well-informed — not to remove it.

Finding Needles in a 300,000-SKU Haystack: An AI-Assisted Product Sourcing Pipeline

Most articles about AI in e-commerce still focus on chatbots, product descriptions, customer support, or content generation. Those are useful areas, but they are not where the most interesting value is created. In many operational businesses, the harder question is not how to answer a customer faster. It is how to make better decisions before the product is even sold.

One example is product sourcing.

For a spare-parts retailer, the theoretical catalog can be enormous. In the project this article describes, the seed was roughly 300,000 manufacturer part numbers accessible through supplier catalogs. Only a small fraction of those products are actually worth buying, publishing, or stocking. Some have no marketplace demand. Some sell but with unacceptable margins. Some are already dominated by Amazon. Some are too crowded with sellers. Some appear promising only because of temporary price distortions. Some look like matches but are actually different products.

No human can evaluate that space manually, and no single query can solve it either. This is not a chatbot problem. It is a funnel problem.

The useful question is not “can AI choose the right products?” The useful question is more precise: how do we build a system that reduces roughly 300,000 possible products into a small, reviewable, economically credible shortlist?

That requires market data, classical data engineering, deterministic filtering, margin logic, caching, validation against real sales, and an LLM used only where it genuinely earns its place. The sections below walk through the pipeline stage by stage, with real numbers from the system, and then through the design rules that hold the funnel together.

What you will learn

How to structure an AI-assisted sourcing pipeline as a funnel — acquisition, normalization, margin modeling, classification, validation — instead of a single “smart” decision step.
Where an LLM actually earns its cost in an operational pipeline, and how to validate and contain it when it is wrong.
How to turn external estimates (demand, prices, competitor stock) from black boxes into calibrated signals by checking them against your own ground truth.

Product sourcing is a data problem before it is an AI problem

The starting point is simple: supplier price lists contain manufacturer part numbers, costs, sometimes EANs, product names, brands, stock information and supplier-specific references.

The marketplace, however, does not always speak the same language.

A supplier may list a part as 00631200. Another source may refer to the same part as 631200. A marketplace title may show 631-200. A seller may include the part number only inside a long compatibility sentence. A listing may mention the reference because the product is compatible, not because it is the original part.

This is where many sourcing projects fail early. The problem is not the availability of data. The problem is identity. Before any AI model becomes useful, the system needs to answer a more basic question: which supplier products correspond to which marketplace products?

That first step is already difficult enough to justify a proper pipeline.

Stage 1: mapping supplier codes to marketplace products

The seed of the pipeline is the list of roughly 300,000 manufacturer part numbers from supplier catalogs. The first job is to discover which of those references exist on Amazon and to map supplier codes to marketplace identifiers such as ASINs.

At first glance, this sounds like a search problem. In practice, it is a normalization problem.

The same manufacturer reference can appear in multiple forms depending on the supplier, brand, marketplace listing, language, punctuation, leading zeros and formatting conventions. A strict exact match looks safe, but it silently loses a large number of valid matches. A fuzzy match recovers more products, but it can easily introduce false positives.

The practical lesson is that match rates must be measured per supplier, not only globally. When we did that, the numbers exposed the bug directly: after switching from strict matching to alphanumeric-only normalization — stripping punctuation, separators and leading zeros before comparison — one supplier went from 85 matched products to 430, a 5× recovery. Another gained more than 1,700 additional matches. Same catalogs, same marketplace, zero new data. The products had been there all along; the join logic was throwing them away.

If one large supplier suddenly produces suspiciously few marketplace matches, the most likely explanation is not that the supplier has no relevant products. The most likely explanation is that the normalization logic is wrong. In this kind of pipeline, identifier normalization is not a small preprocessing detail. It is one of the core business rules.

The goal at this stage is not to match more at any cost. The goal is to improve recall while keeping enough structure to investigate suspicious matches later. The first stage of the pipeline does not try to decide whether a product is worth buying. It only tries to build a sufficiently broad and traceable candidate set. At this point, the system should not exclude aggressively. It should discover.

Stage 2: pulling market data once and reusing it many times

Once supplier references are mapped to marketplace products, the second stage is data acquisition.

For each matched marketplace product, the pipeline pulls the data needed to evaluate demand, price stability, competition and commercial feasibility: historical price data, seller offers, Buy Box information, stock-related signals, ranking or sales estimates where available, and marketplace-level identifiers.

The important architectural decision is that acquisition and decision-making must be separated. Raw API responses are stored locally and treated as a durable asset — in this project, a market-data cache of about 2.4 GB built from tens of thousands of deep API pulls. The decision logic is then computed from cache.

That distinction matters. Without it, every improvement to the scoring model becomes expensive. Every change in a margin formula requires another API run. Every test burns credits. Every bug risks losing previous work. With a proper cache, the system acquires data once and recomputes many times: margin thresholds can change, competition rules can change, exclusion gates can become stricter or looser, and a product can move from rejected to reviewable because the scoring model improved — not because the data had to be fetched again.

In this type of system, the cache is not a technical afterthought. The cache is the product. It is the foundation that allows the business logic to evolve.

Stage 3: evaluating margins on distributions, not snapshots

A naive sourcing system compares today’s marketplace price with today’s supplier cost. If the margin looks acceptable, the product passes. That approach fails in both directions, and the failure is measurable.

Marketplace prices move. They spike, collapse, recover, disappear and reappear. The pipeline therefore evaluates margins against the price distribution — the 90-day average, the median, and a historical price floor computed as a low percentile over the listing’s full lifetime history — rather than against any single number. After marketplace fees, VAT assumptions, shipping bands, handling costs and supplier cost are included, the question is not “is this profitable today?” but “is this profitable across the range of prices this product has actually traded at?”

The aggregate result justifies the extra work: of 9,229 scored products, 1,067 — 11.6% — show a positive margin on the 90-day average price but a negative margin at their historical price floor. More than one product in ten looks profitable on a reasonable-sounding metric and is exposed underneath.

A concrete case makes the mechanism visible. One part in the dataset has 925 recorded price points. Its current price is stable around €99–100, and the median of €95.91 confirms that the 90-day average of €103.48 is genuine — this is not a distorted listing. On the average price, the margin is +€5.44. At the historical floor of roughly €75, the margin is −€14.75. And that floor is not a glitch: the product traded at that level for sustained periods between 2021 and 2023. Today’s margin is real, but the floor tells you exactly what happens if competition returns to its earlier intensity. That is the difference between a sourcing decision and a bet on the current Buy Box.

The averages can lie in the other direction too. Another product in the dataset showed a 90-day average price of €87 — while its lifetime median was €13.56. Decoding the raw price history revealed a three-week spike at €128, during which only one seller remained on the listing. The average was the lie and the floor was the truth: anyone buying on the €87 figure would have stocked a €13 product. Snapshot, average, or floor alone all mislead in different ways. The robust answer is to score against the distribution — median and percentiles — and to treat any single summary statistic as a hypothesis to be checked. This kind of interrogation of live marketplace data is the same discipline that makes post-commit verification of AI-written code so productive: plausible-looking numbers in third-party data hide stacked traps, and only structured investigation against the raw source exposes them.

This is also where the margin model must be documented as a single source of truth. When a product is excluded, the system should be able to explain why — explicitly. Was the margin below threshold at the floor? Was estimated demand too weak? Was Amazon dominating the Buy Box? Was the supplier cost missing? A good sourcing pipeline does not only produce a buy list. It produces a decision trail, and that trail matters because business rules change: a product excluded today under a conservative threshold may become interesting tomorrow if supplier terms improve. Exclusion should be explicit, documented and reversible.

Stage 4: using an LLM only where rules are weak

The most interesting use of AI in this pipeline is also the most limited.

Marketplace listings are noisy. A search for a pump reference may return the original manufacturer part, a compatible replacement, a kit that includes the part, an accessory, a refurbished item, or a completely unrelated product that happens to mention the code. Rules catch some of this — exact references, brand matching, category matching, title patterns all help — but they do not solve the whole problem.

A language model becomes useful when the task is not mathematical but semantic: reading a messy title and description and deciding what kind of match it is. The model classifies each candidate listing into a controlled set of labels — OEM part, aftermarket-compatible, uncertain — and must emit a confidence score and a one-line reason with every verdict.

Two real outputs from the same classification run show what this looks like in practice. A search for a battery-size code returned a listing for “10 x Renata 317 SR516SW lithium watch batteries”; the model classified it uncertain with confidence 0.80, reasoning that it was “a pack of watch batteries, not an appliance spare part.” In the same run, a “wpro USC100 external water filter” was classified aftermarket-compatible with confidence 0.85 — correct: a sellable product, but not the OEM part the reference pointed to. Over the full corpus of 9,915 classified listings, the distribution came out at 50.7% aftermarket-compatible, 39.0% OEM, and 10.3% uncertain. That last number is not a failure rate. It is the model doing its job: routing genuinely ambiguous listings to a human instead of guessing.

The obvious objection is: how do you validate the classifier itself? This article argues for validating demand estimates against ground truth (Stage 5), so the same skepticism must apply to the model. Three mechanisms contain it.

First, the mandatory uncertain class with confidence scores: the 10.3% of listings that land there go to a human review gate and are never acted on automatically. Second, the cost of a misclassification is asymmetric and capped by the funnel itself: a wrongly included listing still has to survive the margin and competition gates downstream, so a false positive costs one manual review, not a purchase. Third, spot-checks on samples of model verdicts during prompt tuning. An honest caveat belongs here: there is no large hand-labeled benchmark behind these numbers yet, and I would rather say so than imply one. The real safety net is the funnel design — the LLM filters; it never decides alone.

This is why the pattern works. The model is not asked to decide what to buy, run the business, replace the margin model or replace human review. It is a classification component inside a larger deterministic system: everything upstream and downstream remains ordinary code — acquisition, normalization, caching, margin computation, scoring, exclusion gates, reporting and review queues. The LLM is used like a scalpel, not like a general-purpose brain. That containment is not only an economic choice; it follows from the sandbox problem: the model knows nothing about suppliers, costs or business rules except the listing text it is explicitly given, so the architecture should never ask it to reason beyond that injection. Scoping the model to the one task where its semantic judgment is genuinely needed is also the same logic as delegating work across model tiers: pay for intelligence only where the task actually requires it — the same token economy that governs an AI-assisted support workflow, where the strongest model is reserved for judgment and the cheaper one handles well-specified work. And like everything else in the pipeline, its results are cached: once a listing has been classified, the verdict is stored and reused unless the underlying listing changes materially. There is no reason to pay repeatedly for the same judgment.

Stage 5: validating demand against reality

Estimated marketplace demand is useful, but it is still an estimate, and the only way to understand its value is to compare it against reality.

In this pipeline, the external sales estimates were cross-checked against roughly one year of our own marketplace order history, on products where a direct comparison was possible. The goal was not to prove the estimate perfect. It was to measure its bias — and the bias turned out to be strongly segment-dependent. This is historical data acting as the real foundation of the AI system in the most literal sense: without our own order history as ground truth, the external estimate would have remained an act of faith.

In aggregate, the API’s sales estimate came out at roughly 0.81× reality: it under-counts total volume. But at the median product, the same estimate was roughly 1.8× reality: it over-counts. The two numbers are not a contradiction. They describe a distribution: the estimate systematically understates fast-moving products and systematically overstates the long tail. A headline correlation — “the estimate tracks our sales reasonably well overall” — would have hidden exactly the bias that matters most for sourcing decisions, because sourcing lives in the segments, not in the aggregate.

Once the bias is measured, the estimate becomes usable. If it understates fast movers, the scoring model can adjust upward in that segment. If it overstates long-tail demand, the review threshold for long-tail candidates becomes stricter. Validation turns an external metric from a black box into a calibrated signal, and that is a recurring lesson in applied data systems: the usefulness of a signal depends less on whether it is perfect and more on whether its bias is understood.

Stage 6 (experimental): competitor stock as a demand sensor

The newest layer of the pipeline is explicitly experimental, and I want to frame it that way rather than oversell it.

The idea is simple. If a competitor exposes available stock on a listing, polling that value over time may reveal movement: a day-over-day decrease can indicate sales; a stable number, low movement; a sudden increase, replenishment. Compared with abstract demand estimates, observed stock movement is closer to actual market behavior — though it can be distorted by feed updates, multi-channel sales, corrections and synchronization delays, so it should never be read as exact ground truth.

The sensor went live this week, and the first 24-hour delta is in the database. Of 65 tracked competitor offers, 3 moved between June 11 and June 12. On one part, a competitor’s stock went from 325 to 318 — seven units sold in a single day at an unchanged price. On another, stock dropped from 8 to 6 alongside a simultaneous €0.50 price cut, a small but legible competitive move. One day of data is a reading, not a validation; the layer needs weeks of deltas before it can earn a weight in the scoring model. If the pattern holds, this layer deserves a dedicated article of its own.

What it already illustrates is the broader principle of the pipeline: the strongest sourcing decisions come from combining imperfect signals — marketplace price history, seller competition, supplier cost, internal sales logs, LLM classification, and now observed competitor movement. None of these signals is sufficient alone. Together, they reduce uncertainty enough to make human review efficient.

Design rules that hold the funnel together

Three rules cut across every stage. They are not stages themselves; they are constraints that the whole pipeline must respect.

Every long-running job must be incremental and idempotent. A pipeline that processes hundreds of thousands of records cannot depend on a perfect execution. SSH sessions drop, API limits hit, servers restart, unexpected responses appear. Each script must know what has already been done, skip completed records and continue from the remaining queue. Long jobs must be killable and restartable without wasting previous work. This is not elegant in theory; it is essential in practice.

Different data decays at different speeds, so caches need different lifetimes. Product identity — a manufacturer reference, a brand, an EAN — is stable for weeks. Supplier prices and stock change in days; competitor stock, as Stage 6 shows, can change overnight. A single cache duration for everything is easier and usually wrong. The pipeline uses a dual-TTL approach: stable identity data is cached long, volatile commercial data is refreshed on a short cycle. This reduces unnecessary calls without making pricing decisions stale.

Missing data must never automatically eliminate a product. This is the most common source of hidden errors in automated decision systems: a product disappears from the shortlist not because it is bad, but because one field was missing, one API did not answer, one supplier did not provide an EAN. That is not a business decision; it is a data accident. The system must distinguish negative evidence from absent evidence. Poor margin, weak demand and heavy competition justify exclusion. A missing field justifies recovery, estimation, a later enrichment stage, or a human flag — never silent rejection. This matters most in long-tail catalogs, where many commercially interesting niche products are imperfectly represented in every data source. A good funnel does not only filter. It preserves optionality until there is a real reason to close the door.

The actual result

The outcome of this architecture is not an automatic buying machine. That would be the wrong goal.

The result is a decision-support system that reduces an impossible review space to a manageable one. Instead of manually evaluating roughly 300,000 supplier references, the business focuses on a shortlist of candidates with verified marketplace presence, margins modeled against full price distributions, measured competition, calibrated demand indicators, classification labels with confidence scores, and a documented explanation of why each product survived the funnel — and why every other product did not.

The shape of the ending is deliberate: the pipeline prepares, the human approves. It is the same human-gate pattern, at a much larger scale, as the deployment queue that records what an AI assistant changed without granting it permission to ship — in both systems, the automated actor produces a structured, inspectable artifact, and the decision that carries real risk stays with the person who owns the consequences.

The most important lesson is that useful AI systems rarely begin with the model. They begin with the workflow. The LLM in this pipeline matters, but only in one stage — and that is exactly why it is useful. If the whole pipeline depended on the model, it would be expensive, opaque and difficult to control. If the pipeline ignored language ambiguity entirely, it would drown in false matches. The practical solution sits between those extremes: deterministic engineering for everything that can be made deterministic, and AI only where text ambiguity makes rules genuinely weak. It is the same conclusion reached in building a self-improving image recognition pipeline without training a model: the value comes from the structure around the model, not from asking the model to be the structure.

For me, this is the more serious direction for AI in e-commerce and operational systems. Not a chatbot attached to the side of the business, but a carefully placed component inside the decision processes that determine what the business should do. It removes noise, classifies ambiguity, accelerates repetitive analysis, and gives the human decision-maker a structured shortlist instead of a raw catalog.

The value is not in asking the model to be brilliant. The value is in designing a system where the model only has to be useful.