Plagiarism has exploded in the Covid-19 age. As more people worked from home and attended classes via Zoom, without direct in-person supervision, the temptation to co-opt someone else’s work has grown exponentially, as have ever more sophisticated ways to copy another person’s work.
Tricks like replacing a letter like “o” with a similar looking character in a non-Latin alphabet or using “invisible” text highlighted in white to outsmart current copyright detection programs have become commonplace.
The average percentage of plagiarism before and after Covid increased from 26% to 45% in The Netherlands, from 37% to 49% in France and from 42% to 53% in India, according to a survey of 51,000 college and high school students conducted by anti-plagiarism software maker CopyLeaks.
The solution is not more of the same – where software checks a database for copied words and paragraphs – but the use of artificial intelligence (AI) that doesn’t just compare words to words but also “meaning for meaning,” explains Alon Yamin, CEO of CopyLeaks.
The scrappy Israeli startup is used by schools and organizations around the world, including Macmillan Publishers, Stanford University, the BBC, Medium, the National Space Society, the United Nations, Cisco and Accenture, as well as by students, bloggers and journalists.
CopyLeaks’ extensive client list reveals not only how widely its software can be used but how pervasive the plagiarism problem has become.
Schools may be the top use case for anti-plagiarism tools, but publications and book publishers can also use CopyLeaks to ensure their writers haven’t misappropriated – even accidentally – someone else’s work (journalists, for example, will often paraphrase text from another article, assuming they’ve made enough changes to make it their own; if not, the publication could be subject to legal action).
Misuse of content
Companies developing corporate websites are another source of potential clients for companies like CopyLeaks. Here the benefit is in reverse – has someone else copied your work?
The latter is how cofounder and CopyLeaks CTO Yehonatan Bitton found his calling in the anti-plagiarism space.
In 2013, Bitton was developing content for a family-owned website when he found it was being copied by competing sites. The theft was frustrating, but even worse, these multiple sources of identical content were driving the site’s search rankings down, negatively impacting sales.
Bitton looked for a software solution to detect such misuse of content but couldn’t find any. He subsequently floated the idea of building something that could solve his problem to Yamin, his then work colleague and fellow graduate of the IDF’s 8200 signal intelligence unit.
Yamin was instrumental in developing AI and machine learning-powered algorithms for Israeli army intelligence; it was that technology that became the basis for CopyLeaks.
Some 70 million instances of copyright infringement were uncovered by CopyLeaks’ technology from 75 million pages scanned and 58 million documents compared.
CopyLeaks uses AI to understand a writer’s “voice.” That goes beyond just the words, where automated tools “can play with the text, change words and their order, making it easy to mask plagiarism,” Yamin tells ISRAEL21c.
“Even if not a single word is identical, we can detect if the meaning or the sentence structure is very much the same.”
That’s not beyond the ability of human readers, “but we can do it in an automated way at very high volume.”
And in a growing number of languages: CopyLeaks currently supports over 100 tongues, including Hebrew and Hindi.
CopyLeaks can help schools and publications prevent intentional or accidental copyright infringement, but it is also a way “to authenticate oneself, to make sure you’ve paraphrased enough, that you’ve attributed all your quotes properly. Our goal is to promote authenticity,” Yamin says.
The interface shows side-by-side comparisons of the original text on the left and the flagged text on the right, complete with links to the source where it was lifted from. Reports can be downloaded as PDFs.
“A CopyLeaks scan [for plagiarism] can take from a few seconds to a few minutes depending on such factors as the size of the document or the number of results,” Yamin says.
On demand or always on
CopyLeaks can be used as a site license purchased by a school, institution or publication; by individual writers who pay based on the number of words and pages checked; or integrated into an existing LMS (learning management system).
The technology works with most of the top LMSs including Moodle, Blackboard, Canvas, Brightspace and Schoology – these cover some 90% of academic institutions. The software can be run on-demand (upload a file and click “scan”) or run constantly in the background.
Pricing runs from $10 a month for 1,200 pages a year or 300,000 words to $566 a month for 120,000 pages a year and 30 million words. Pricing for large institutions is customized to meet their specific needs. There is a free trial, too, where users can kick the tires for around 10 pages a month.
CopyLeaks supports 25 file types including image files, where OCR (optical character recognition) algorithms ferret out any offending content. It can even scan computer code that programmers write as part of application development.
Customers can set how sensitive they want the software to be; there are six different levels. “Some customers only care about copy/paste type of plagiarism. So, the sensitivity will be very low. Others care about everything that could possibly be similar, so the sensitively level will be very high. You can play with that and see what results are relevant for you in your use case,” Yamin says.
CopyLeaks recently introduced a new tool: grading written essays using AI.
“We ran a pilot with the Ministry of Education in Israel. We were just one point apart out of 100 points compared to human graders. It’s very accurate and fast – we can do it in just five minutes. And it’s completely unbiased,” says Yamin.
A global problem
CopyLeaks is not the only plagiarism detection tool keeping writers on their toes. The 800-pound gorilla in the space is Turn It In, which was acquired for $1.7 billion by Advance Publications in 2019.
Turn it In has, in turn, been busy acquiring smaller competitors, leading to a David vs. Goliath type of showdown for CopyLeaks which has just 25 people in its two offices (Kiryat Shemona in Israel for R&D and Stamford, Connecticut for sales and marketing).
And while it’s far from the nearly $2 billion Turn It In received, CopyLeaks just raised a $6 million Series A round, on top of $1.8 million in 2018 from Connecticut Innovations (hence the reason HQ is in Stamford).
Yamin notes that CopyLeaks has more than 200,000 individuals who use it every month and another few hundred B2B (business-to-business) customers, such as publishers and schools.
What about the kinds of essay factories typically found at fraternities on university campuses? Will CopyLeaks put these out of business?
If you paid someone to write completely original content, that will be hard to detect, Yamin admits, but if the same student submitted an essay that he or she wrote independently, CopyLeaks can compare the “voice” to see if it’s the same.
CopyLeaks is focused on text and images so far, but Yamin says scanning other media will come in the future, including copyrighted videos posted to file sharing sites.
Is there any geography that’s particularly egregious in copyright infringement? Yamin says no. “It’s really a global problem. It happens everywhere.”
For more information on CopyLeaks, click here
How to catch plagiarized text
Software may be the best way to ferret out plagiarized text, but the human eye can still catch some of the most egregious lifting. Here are the main areas to monitor, according to CopyLeaks:
- Incoherence in writing style or sudden changes in writing patterns.
- Writing style variation from word to word or in different paragraphs.
- If the document does not relate to the given topic.
- References or sources that were not recommended in class.
- Drifts and shifts in subject matter.
- Different citation methods
- Variation of font style and size between paragraphs.
- Multiple sources mentioned without any quotation.
- No quotes but extended cited sources.