Elicit and Consensus: AI for Academic Research
Elicit and Consensus are AI tools built specifically for academic research — searching, reading, and synthesizing scientific papers. They occupy a niche that Google Scholar plus an LLM handles passably but not well, and the question is whether dedicated tools justify their existence. The short answer: for serious literature reviews, yes. For casual research, probably not. The long answer involves understanding what each tool actually does versus what the landing page implies.
What It Actually Does
Elicit and Consensus attack the same problem — "what does the research say about X" — from different angles.
Elicit is a research assistant that does semantic search over a corpus of academic papers and then extracts structured information from the results. You type a research question, and instead of matching keywords like Google Scholar, Elicit finds papers whose content is semantically relevant. "Does mindfulness meditation reduce cortisol levels" returns papers that study that relationship, even if they don't use those exact words. Once you have results, Elicit's extraction features are where it gets interesting. You can ask it to pull specific fields from each paper — sample size, methodology, key findings, limitations, effect sizes — and it builds a structured table across your results. Imagine doing a literature review where instead of reading 40 abstracts one at a time, you get a spreadsheet with the methodology and findings from each paper extracted automatically. That's the pitch, and in practice, it works about 75% of the time.
Consensus takes a different approach. It's positioned as a search engine that gives you "evidence-based answers" by synthesizing findings across papers. You ask a question, and Consensus returns a set of relevant papers along with a synthesis — a paragraph summarizing what the research collectively says, sometimes with a "consensus meter" that shows the balance of evidence. Think of it as Perplexity but trained exclusively on peer-reviewed literature. The synthesis is the product. Instead of reading 20 abstracts yourself, you get a summary with citations and a directional indicator of scientific agreement.
Where both tools genuinely help is in the discovery phase of research. Semantic search finds papers that keyword search misses. If you're studying the relationship between sleep and cognitive performance, Google Scholar requires you to think of every relevant keyword combination — "sleep deprivation cognition," "insomnia executive function," "circadian rhythm memory consolidation." Elicit and Consensus understand the concept and find papers that address it using terminology you might not have thought of. For interdisciplinary topics where the same phenomenon gets studied under different names in different fields, this is a real advantage.
Elicit's structured extraction is the feature that saves the most time for serious researchers. A systematic literature review — the kind where you need to catalog the methods, populations, findings, and limitations of every relevant study — is weeks of manual work. Elicit compresses the initial extraction pass from hours per paper to seconds. You still need to read the papers. But instead of reading them to extract basic metadata, you read them to verify and deepen what the tool already pulled. The workflow shifts from "find and extract" to "verify and analyze." That's a meaningful change.
Consensus's consensus meter is the feature that gets the most attention and deserves the most skepticism. It presents a visual indicator — essentially a bar showing what percentage of papers agree with a proposition. "Does exercise reduce anxiety?" might show 87% yes. The appeal is obvious: a clear, quantified answer to a scientific question. The problem is also obvious, and we'll get there.
What The Demo Makes You Think
The demos for both tools lean heavily on the "get answers from science in seconds" framing. Consensus's marketing shows a question, a clear synthesis, and a consensus meter — science made simple. Elicit's demos show a question turning into a structured table of findings across 20 papers — literature review made automatic.
Here's where the demo diverges from reality.
The accuracy of AI paper summaries is good but not research-grade. When Elicit extracts "key findings" from a paper, it's summarizing what it interprets the paper to say. Most of the time, the summary matches the paper's actual conclusions. But "most of the time" is not good enough for research that other people will cite. I've seen Elicit extract a finding that was actually a finding the paper was arguing against — the AI picked up the claim from the introduction's literature review, not from the paper's own results. This happens maybe 10-15% of the time on complex papers [VERIFY], and it happens more on papers where the discussion section is nuanced or where the authors are qualifying someone else's earlier work. The extraction is a first draft, not a finished product. If you treat it as the latter, you will cite papers for claims they don't actually make.
Consensus's consensus meter reduces complex debates to a percentage, and that's a feature and a trap. Science is not a vote. A consensus meter that shows "92% of papers support X" doesn't tell you that 3 of those papers had sample sizes under 20, that the one dissenting paper was a massive meta-analysis, or that the field has since moved toward a more nuanced position. The meter is a useful heuristic for "is this a controversial claim or a settled one," but it's actively misleading as a measure of scientific truth. The demo shows a clean number. Real science is messier than any number can capture.
Both tools inherit the biases of their training corpus. They search over published, peer-reviewed literature — which means they inherit publication bias (positive results get published more than null results), English-language bias (most indexed papers are in English), and recency gaps (the most recent papers may not be indexed yet). If you ask "does intervention X work" and the published literature is skewed toward positive results because null results didn't get published, both tools will faithfully reflect that skew. Neither tool warns you about this. They present the available evidence, not the complete evidence — and the difference matters.
The free tier versus paid tier split is real. Elicit's free tier gives you limited searches and extractions per month. The Pro tier — which is where the structured extraction and larger-scale features live — runs $10/month for academic users and more for professional use [VERIFY]. Consensus has a similar split, with the free tier capped on searches and the synthesis features gated behind payment. For a graduate student doing one literature review per semester, the free tier might suffice. For someone doing research as a regular part of their work, the paid tier is where the useful features live.
What's Coming (And Whether To Wait)
Both tools are actively developing, and the trajectory for both is toward deeper extraction, better accuracy, and broader corpus coverage.
Elicit has been expanding its extraction capabilities — more field types, better handling of complex paper structures, and improved accuracy on the extraction itself. The roadmap likely includes better handling of supplementary materials (where a lot of actual data lives in modern papers), cross-paper synthesis features that go beyond simple extraction tables, and tighter integration with reference managers like Zotero [VERIFY]. The underlying models are getting better at understanding scientific text, which means the extraction accuracy improves as a rising tide.
Consensus is pushing toward more domain-specific coverage and better synthesis quality. The consensus meter is getting more nuanced — factoring in study quality, sample size, and methodology rather than just counting papers [VERIFY]. This would address the biggest criticism of the current approach, though the fundamental tension between "quantify scientific agreement" and "represent scientific complexity" isn't going away.
The competitive landscape matters here. Google Scholar is not standing still — Google has the corpus, the search infrastructure, and the AI models to build something similar. Semantic Scholar from the Allen Institute for AI offers many of the same capabilities. And the general-purpose LLMs are getting better at understanding scientific papers when you paste them in directly. The dedicated tools have an advantage in specialization and UX, but that advantage narrows every time Claude or GPT gets a context window upgrade or a better approach to document analysis.
Should you wait? If you're actively doing research that involves literature reviews, no — these tools save real time today, even in their current state. If you're a casual user who occasionally wants to know "what does the science say," Perplexity or a general LLM probably gets you 80% of the way there without another subscription. The dedicated tools earn their keep through volume — the 50th paper you process through Elicit saves more time than the first, because you've built a workflow around the extraction features.
The Verdict
Elicit earns a slot for anyone doing systematic or semi-systematic literature reviews. The structured extraction — pulling methods, findings, and limitations across dozens of papers — is the killer feature, and nothing else does it as well. Treat the extractions as a first draft that needs verification, not as a finished product, and it will save you hours per review.
Consensus earns a slot for researchers who want a quick read on the balance of evidence on a question, and for journalists or writers who need to accurately characterize scientific agreement. The consensus meter is a useful starting point that becomes dangerous when treated as an endpoint. Use it to orient yourself, then read the papers.
Neither tool replaces reading papers. Both tools change the workflow from "find and read everything" to "find, triage, and deep-read the papers that matter most." For graduate students and professional researchers, that's a meaningful productivity gain. For everyone else, the general-purpose LLMs handle most "what does research say" questions well enough that the specialized tools are a luxury rather than a necessity.
The honest summary: Elicit and Consensus are the best tools for AI-assisted academic research, and they're still not good enough to trust without verification. That's not a knock — it's the nature of research. The tools that make you faster at finding and extracting information are valuable. The tools that make you think you don't need to verify what they found are dangerous. Use them as the former.
This is part of CustomClanker's Search & RAG series — reality checks on AI knowledge tools.