Guides & ReferenceFor Researchers

Claude Code for Researchers: Synthesis Across a Folder of Sources

Last updated: June 10, 2026

A PM asks “what did users say about checkout?” and the researcher spends the next hour Ctrl+F-ing through seventeen transcripts. That afternoon is common enough that one UX researcher documented it almost word for word before rebuilding her whole process. Claude Code replaces that hour with one typed question, answered with verbatim quotes and file names.

This is a guide to Claude Code for researchers who work in words: interview transcripts, PDFs, field notes, and synthesis that has to survive peer review. No code required. I spent years as a PM doing transcript archaeology by hand, and this is the tool I wish someone had handed me.

Why Claude Code for Researchers Who Don’t Write Code

Claude Code is a chat that runs inside a folder on your computer, which turns out to be exactly the shape of qualitative analysis. Point it at a directory of transcripts and it reads every file, searches across them, and writes new ones: theme summaries, evidence tables, annotated bibliographies. The official docs have a whole section on working in notes and non-code folders, so this is supported use, never a hack. It starts read-only and can only write inside the folder you launched it from, per the security docs.

Three things make it better than pasting transcripts into a chat window:

  1. Your data stays put. Transcripts live as files in one folder. No uploading ten attachments per conversation.
  2. You can re-query. UX researcher Else van der Berg, who runs 50-transcript studies this way, put it plainly: “When my ‘listening for’ changes, I can re-query the data in minutes.”
  3. Outputs are files too, which means themes.md and evidence-table.csv exist tomorrow, survive a closed laptop, and can be emailed to a coauthor who has never heard of any of this.

One scoping note. If you are an economist who writes R and wants scripted regressions, Paul Goldsmith-Pinkham’s researcher setup series is the better guide. This page is for qualitative and mixed-methods researchers, UX researchers, and grad students whose raw data is language. You have company: 88% of UX researchers already use AI for analysis and synthesis, per an NN/g figure. The open problem is doing it without torching your integrity, which is why half this page is about verification.

If you have never touched a terminal, start with the complete guide for non-developers, then come back.

Claude Code Interview Analysis: Themes, Quotes, and Contradictions

The setup that makes everything else work is boring: one folder, one file per interview, consistent names (p01-maria.md, p02-devon.md). Van der Berg’s insight from her 50-transcript practice is that consistent structure is what makes a folder queryable. Inside a session, type @ to point Claude at any specific file, a habit the course teaches in working with files.

The interview synthesis pipeline
  1. 1
    Set up the folder
    One file per interview, consistent names, plus a CLAUDE.md holding your research questions and the citation rule.
  2. 2
    Code in batches of 5 to 8
    Ask for themes with a verbatim quote, file name, and location behind every claim.
  3. 3
    Hunt contradictions
    Ask where participants disagree with each other, and with your own hypothesis.
  4. 4
    Run the verification pass
    Claude searches each quoted string in its source file and flags anything it cannot find.
  5. 5
    Synthesize across batches
    Merge batch summaries into themes.md and an evidence table a coauthor can audit.

A first-pass prompt that bakes in traceability:

Read the transcripts in this folder. For each theme you find, give me:
the theme, a one-paragraph summary, and 2 to 3 verbatim quotes with the
file name and approximate location of each. Only quote text that appears
word for word in the files.

Then go looking for trouble on purpose:

Where do participants contradict each other about pricing? Show both
sides with verbatim quotes and file names. Then list anything in these
transcripts that contradicts the hypothesis in CLAUDE.md.

That second prompt matters because models drift toward agreeable summaries. One researcher described the failure mode precisely: “AI confidently told me the main barrier was pricing, but the real signal was buried in how people talked about trust.” Claude does the mechanical reading. The interpretation, the part your name goes on, stays yours. Andrea Chiarelli, who automated thematic coding for a real evidence review, draws the same line: the machine handles the mechanical part of coding, the researcher keeps the intellectual part. His output format is worth stealing wholesale: a coded evidence table as CSV, a codebook, and a processing log, with “every quote traceable back to its source document.”

Ten Transcripts at Once: Parallel Agents for First-Pass Coding

Do not dump 50 transcripts into one conversation. The context window fills, quality sags, and you cannot tell where. Practitioners converge on batches of 5 to 8 transcripts per pass, synthesized afterward across batches.

For bigger studies, subagents are the supported way to parallelize. Each one runs in its own context window, works concurrently, and reports back a summary, per the agent docs. Since they run at the same time, a batch of independent subtasks effectively finishes in the time of the slowest one. The course covers them hands-on in the agents lesson.

Launch 5 parallel agents. Each takes two transcripts from this folder,
codes them against the codebook in CLAUDE.md, and writes findings to
/coded as one file per transcript, every quote verbatim with a file
reference. When all agents finish, merge the results into themes.md.
First-pass coding, 10 transcripts (illustrative)
One conversation, one transcript at a time~100 min
Five parallel subagents, two transcripts each~25 min
Illustrative timing, not a benchmark. Subagents run concurrently per the Claude Code agent docs, so a batch effectively finishes in the time of the slowest task. Verification takes the same time either way.

My first parallel run, I skipped the codebook, and five agents handed back five incompatible theme lists with five different names for the same complaint. Pin the codebook before you launch; agents cannot compare notes mid-flight. The published anecdotes still point the right direction: Chiarelli describes several days of manual coding compressing into an afternoon of review. If you run studies every week, define a reusable transcript-coder in .claude/agents/ (the custom subagents lesson shows how) so the instructions stay identical across projects.

The verification pass does not parallelize away. Budget for it.

Literature Review: PDFs In, Annotated Bibliography Out

A Claude Code literature review is the same folder workflow with PDFs, and for grad students it is usually where Claude Code earns its keep in academic research first. Drop the papers into /papers and ask for structure:

For each PDF in /papers, write an entry in bibliography.md with: full
citation taken from the PDF itself, research question, method and
sample, key findings, limitations, and relevance to the research
questions in CLAUDE.md. Only include papers that exist in this folder.

That last sentence is the integrity rule making an early appearance. More on it below, because it is the difference between a tool and a liability.

Mechanically, the Read tool takes short PDFs whole and reads anything over 10 pages in ranges, up to 20 pages per request. Claude handles the paging itself; you only notice on a 60-page paper, which takes several reads. Scanned PDFs are images of text, so run them through OCR first.

Getting Tables Out of PDFs

Complex PDF tables garble when extracted as text. The fix, settled in a long-running GitHub issue, is to treat table extraction as a vision problem: convert the page to an image and let Claude read it with its eyes.

pdftoppm -png -f 14 -l 14 -r 150 paper.pdf table

Then: “Read @table-14.png and rebuild Table 2 as a CSV.” Claude Code can run that conversion command for you if you ask, which means in practice you just say “page 14 has a table, extract it.” Once the numbers are in a CSV, the data analysis guide covers everything you might do next.

Your Research Journal Lives in CLAUDE.md

CLAUDE.md is a plain text file Claude reads at the start of every session, which makes it a research journal that reads itself. Run /init to generate one, keep it under roughly 200 lines, and fill it with the things you currently re-explain every session: the project in one line, your research questions, the codebook, a decisions log, and your integrity rules.

The codebook entry earns its keep fastest. Remember the five incompatible theme lists from the parallel run? Same disease across sessions, and Chiarelli has a name for it: codebook drift, where an AI quietly redefines what “switching cost” means and your coded data stops being comparable. A pinned codebook that loads automatically every session is the cheap fix. Type /memory anytime to see what Claude has loaded, and note that recent versions also keep their own running notes between sessions. The project memory lesson covers the whole system in fifteen minutes.

Pointing Claude Code at an Obsidian Vault

An Obsidian vault is already a folder of markdown files, so Claude Code works in it with zero conversion.

cd ~/Documents/MyVault
claude

The single most useful step, per community consensus, is a CLAUDE.md at the vault root that says this is an Obsidian vault and that [[wikilinks]] are valid link syntax. With that in place, “summarize my reading notes on grounded theory and link the summary to the relevant source notes” produces a properly linked note instead of an orphan. If your vault is also where drafts happen, the writers guide covers that side. Above roughly 2,000 notes, add ignore patterns so Claude skips your daily-notes graveyard. A community MCP plugin adds backlink-aware and tag-aware operations, and you will not need it in week one.

The Integrity Rule: Claude Only Cites Files in the Folder

The top-voted reply when r/UXResearch discussed AI analysis tools: “highly prone to hallucinations, faking data… we used to call that a dry lab.” The skeptics are right about the failure mode, so build the workflow around it.

Two distinct failures to defend against. First, invented citations: ask a model for sources from memory and fabrication rates run 14.23% to 94.93% across 13 models, per the 2026 GhostCite study. An audit of 111 million references found roughly 146,932 hallucinated citations in 2025 papers alone, and even NeurIPS 2025 accepted 53 papers containing 100 fabricated citations past three to five expert reviewers each. Second, the subtler one, regenerated quotes: in one Learning Analytics audit, 7.7% of quotes presented as verbatim could not be found in the source transcripts. Most were lightly smoothed, with filler words dropped and phrasing tidied, which is worse than fiction because it sails through a skim.

14 to 95%
citation fabrication when models cite from memory
13 models, GhostCite 2026
~147,000
hallucinated citations found in 2025 papers
audit of 111M references
7.7%
of AI-supplied verbatim quotes not found in source transcripts
Learning Analytics audit
Sources: GhostCite 2026 (arXiv); audit of 111M references (arXiv); Learning Analytics quote audit via UIntent. All linked in the text above.

The first time I asked Claude for “the strongest quote about onboarding,” it handed me a beautiful sentence that did not exist: three real fragments, one tidy splice. That cured me of treating verification as polish.

So the rule, stated once and enforced everywhere: Claude may only cite files that exist in the folder. Write it into your CLAUDE.md in the bluntest language you can manage; for literature work, that means a line like “NEVER generate a citation not present in references.bib.” The full pattern:

## Integrity rules
- Only cite files that exist in this folder. Never cite from memory.
- Every quote must be verbatim, with file name and location.
- If unsure a quote is exact, say so instead of smoothing it.

Then end every analysis with the verification pass:

For every quote in themes.md, search its source file for the exact
string. Output a table: quote, file, found verbatim yes or no. Flag
anything not found, including close paraphrases.

Spot-check the flags yourself against raw transcripts. When someone on Hacker News asked whether AI tooling in a PhD literature review is dishonest, the consensus answer was that transparency converts the question into method: document the pipeline and the verification step in your methods section. My own unhedged version: an AI research workflow without a verification pass is a dry lab with better branding, and you should not publish from one.

IRB and Privacy: What Actually Leaves Your Laptop

Your files live on your machine, and the contents of any file Claude reads are sent to Anthropic’s servers. Both halves of that sentence are true, and the second one is what your consent forms care about. Claude Code runs locally and transmits prompts, outputs, and file contents it reads over TLS; anyone who tells you “everything stays local” is wrong in the way that matters.

As of June 2026, the retention picture works like this, per the data usage docs and Anthropic’s consumer terms update. On Pro and Max, your data trains models only if the model-improvement setting is on: on means five-year retention, off means 30 days. Flip it at claude.ai under Settings, then Privacy. Team, Enterprise, and API accounts are never trained on by default, hold data 30 days, and qualified Enterprise accounts can request Zero Data Retention. One local detail people miss: your session transcripts sit in plaintext under ~/.claude/projects/ for 30 days by default, so a stolen unencrypted laptop is part of your threat model too.

For sensitive transcripts, three practical rules:

  1. De-identify before the folder. Pseudonyms, stripped employers and place names, done before Claude ever reads a file.
  2. Ask your IRB, because they are already asking about you. Michigan added mandatory AI-use questions to applications in May 2026, Northeastern’s September 2025 form explicitly covers AI-assisted analysis of interview data, and Pitt requires IT pre-approval of generative AI tools for research data.
  3. Check your consent language. If participants consented before AI analysis was in your protocol, do not assume it stretches to cover this. One qualitative researcher named the real stake in a study of AI attitudes in the field: “If people found out that I was feeding data into an LLM, I think that can cause a breakdown of trust.” Participant trust is the whole asset. Protect it like data.

If your lab is choosing a plan, the limits and pricing guide translates the official pricing page into plain English, including which tiers carry the no-training default.

Learn It Inside Claude Code, Free

Everything above assumes you can start Claude Code, reference files, and run agents. My free course teaches exactly that, and the lessons run inside Claude Code itself: you type /start-1-1 and Claude walks you through real exercises with real files. Install takes about 15 minutes, the course download takes two, and Module 1 covers files, project memory, and the parallel agents from this guide. I built it for people who have never opened a terminal on purpose, and it stays free.

FAQ

Will Claude make up quotes from my transcripts?

It can. In one audit, 7.7% of quotes an LLM presented as verbatim could not be found in the source transcripts, usually because filler words were dropped or sentences were lightly reworded. Require a file name and location with every quote, then run a verification pass that searches each quoted string in its source file.

Is my interview data used to train Claude?

On Pro and Max plans, only if your model-improvement setting is turned on. With it off, conversations are retained for 30 days instead of five years, and Team, Enterprise, and API accounts are never trained on by default. Whatever the plan, the contents of files Claude reads are sent to the API, so de-identify sensitive transcripts before analysis.

Can Claude Code read PDFs?

Yes. Short PDFs are read whole, and longer ones are read in ranges of up to 20 pages per request. Scanned PDFs need OCR first, and complex tables come out cleanest if you convert the page to an image and let Claude read it visually.

Does Claude Code replace NVivo, ATLAS.ti, or MAXQDA?

No. CAQDAS tools organize coding you do by hand, while Claude Code automates the mechanical reading and first-pass coding that feeds it. Many researchers use both: Claude Code for the fast first pass, their CAQDAS tool for the audit trail.

Do I need to know how to code to use Claude Code for research?

No. You work in plain English, and the official docs explicitly support folders of notes and documents rather than code. If you can name a folder and describe what you want, you already have the prerequisite skills.