Blog/AI Engineering/February 20, 2026

Helping AI Make Sense of the Knowledge Dump

I got tired of copy-pasting internal docs into AI conversations. So I built a CLI tool to scrape them into markdown—then discovered the real problem was teaching the agent how to make sense of it all.

ai-agentsknowledge-basecontext-windowclillm

This whole thing started because I got tired of copy-pasting.

I was working with an AI agent and kept running into the same friction where the information it needed lived behind logins, in internal wikis, private Slack channels, and meeting recording platforms. Every time I wanted the agent to understand something about our systems, I'd open a browser tab, copy the content, paste it into a file, and tell the agent to read it. It worked, but it was tedious, and the moment that content changed upstream my local copy was already wrong.

So I built a CLI tool that could authenticate against our internal systems, scrape the content, and convert everything into markdown files the agent could read directly. Confluence pages, Slack channels, meeting transcripts, internal websites, all pulled down into a local directory of clean markdown. The idea was simple: give the agent a folder full of organizational knowledge and let it read what it needs.

That part worked. The "let it figure things out" part didn't.

Access isn't understanding

The agent treated a casual Slack message from two years ago with the same weight as our current architecture documentation. It grabbed something someone said while thinking out loud in a meeting and presented it as an established decision. It read an onboarding guide that described how a system was supposed to work and completely missed the three Slack threads from last month explaining how it actually works now.

The problem wasn't that the agent lacked information, it was that it had no sense of what was authoritative, what was stale, what contradicted what, and in what order things should be consumed. Having access to everything and understanding none of it turned out to be worse than having access to nothing, because at least with nothing the agent knows it doesn't know.

Scraping turned out to be the easy part

The CLI tool worked well enough. It remembered credentials, could authenticate against different systems, and pulled everything into clean markdown. I added the ability to re-run the same scrape to get fresh content, so periodic syncing meant the knowledge base stayed reasonably current with a single command.

Something that surprised me was that AI agents are already really good at using CLI tools on their own. I didn't need to build an MCP server or any kind of fancy integration because the agent could just run the CLI itself to pull fresh content whenever it needed to. The filesystem was the entire interface, and that turned out to be enough.

But the real problem, the one I'd set out to solve by building the knowledge base in the first place, was still there. The agent had access to all the right information but had no idea how to make sense of it.

AGENTS.md

The solution was a file I called AGENTS.md, which was not a system prompt or a configuration file but a set of navigation instructions that taught the agent how to build its understanding from the sources available to it.

The key realization was obvious in hindsight: the agent needs to learn the same way a person would. Foundations first, then details, then recent changes layered on top.

The instructions told the agent to start with architecture docs and design documents to establish the foundational mental models. Then PRDs, so it understood what we were building and why. Then recorded onboarding sessions via their transcripts, which gave it essentially the same ramp-up that a new team member would get.

Only after building that foundation should the agent layer on information from other sources like Slack conversations from recent months to understand what's changed, meeting transcripts to catch decisions that haven't made it into formal docs yet, and recent internal documentation that reflects the current state of things.

The ordering matters more than you'd think. You can't make sense of a Slack thread about a migration problem if you don't understand the system architecture, and you can't evaluate whether a meeting transcript captures a final decision or just a brainstorm if you don't know what the PRD says. You need the foundation before the deltas mean anything.

What goes in the navigation instructions

The AGENTS.md file covers a few things that together give the agent a framework for navigating the knowledge base:

What each source type is good for. Architecture docs describe intended system design, Slack threads capture real-time problem solving and informal decisions, meeting transcripts record discussions and the decisions that came out of them, and internal docs cover processes and current state. Each source has different signal density and different reliability, and the agent needs to understand those differences.

Where to start. For any given topic the instructions define a progression: architecture and design docs first, then PRDs, then onboarding material, then internal process docs, then recent Slack and meeting content. This isn't a rigid rule, just enough guidance to keep the agent from building understanding backwards.

How to handle contradictions. When a Slack thread from last week says one thing and an architecture doc from last year says another, the right answer depends on the type of claim. For how a system actually behaves right now, recent Slack is probably more accurate. For design intent and the reasoning behind decisions, the architecture doc likely still matters. The instructions help the agent make those judgment calls instead of treating everything as equally true.

What to treat as fact versus opinion. Someone saying "I think we should switch to Kafka" in a meeting is very different from an ADR that says "we are switching to Kafka," and the agent needs to distinguish between claims, opinions, decisions, and established facts. The navigation instructions make those distinctions explicit.

What this looks like in practice

Say the agent needs to understand our notification system to help debug an issue.

Without navigation instructions it would search every markdown file for "notification" and stuff whatever it finds into its context window, mixing random Slack messages with architecture docs and meeting notes with no sense of what matters or what's current.

With the instructions it starts with the architecture doc that describes the notification service's role and boundaries, which gives it the mental model of what the service does, how it connects to other services, and what the design constraints are. Then it reads the relevant PRD so it knows not just what the system does but why it exists and what product requirements drive it. Then it reads recent Slack threads from the relevant channel where the ground truth lives, the recent bugs, the workarounds people are using, and the changes that shipped but nobody documented yet.

Each layer builds on the previous one. The architecture doc gives the agent the vocabulary to understand the Slack threads, the PRD gives it context to evaluate whether a recent change is intentional or a workaround, and the Slack threads tell it what's actually happening right now that the formal docs haven't caught up to.

The context window isn't just a bucket to fill. It's a workspace, and what occupies it, in what order, determines the quality of reasoning that happens there.

What I've learned

Markdown is the right format. Agents parse markdown natively, so converting everything to markdown files in a directory structure is the lowest friction path to making organizational knowledge accessible without needing databases, retrieval APIs, or special tooling.

CLI tools are underrated as an interface. Agents are already good at running commands, so a CLI that can authenticate, scrape, and sync is something the agent can operate on its own without elaborate integrations. A good CLI and the filesystem get you surprisingly far.

The instructions matter more than the content. Having all the right documents is necessary but not sufficient. The AGENTS.md that tells the agent how to learn from those documents is what turns a pile of markdown files into a usable knowledge base.

Foundations first, deltas second. The agent needs to build understanding the way people do by establishing the baseline from stable, authoritative sources and then layering on recent changes from more volatile ones. Without the foundation the deltas are noise, and without the deltas the foundation goes stale.

Syncing solves staleness. The ability to re-scrape all sources with a single command means the knowledge base reflects current reality instead of a snapshot from when you first set it up, and if refreshing is easy you'll actually do it.

Where I've landed

My current setup has three pieces: a CLI scraper that authenticates against internal systems and pulls content into a directory of markdown files while remembering credentials and source configurations so re-syncing is a single command, the knowledge base itself which is just organized markdown files from internal docs, Slack channels, meeting transcripts, and websites sitting on disk, and AGENTS.md which contains the navigation instructions that teach the agent to build understanding in layers starting with architecture and design docs, then PRDs and onboarding material, then recent Slack and meeting content for the latest state of things.

The agents I work with now don't know everything about our organization. But they know how to learn what they need, in the right order, from the right sources. That turned out to be the whole game.