logo80lv
Articlesclick_arrow
Research
Talentsclick_arrow
Events
Workshops
Aboutclick_arrow
profile_loginLogIn

Meta Has Reportedly Trained Its AI With Pirated Books Despite Warnings From Lawyers

The information comes from the recent complaint, which incorporates chat logs from a researcher at Meta.

Image Credit: Ascannio, Shutterstock

As if there weren't enough reasons to dislike Meta and its business practices, it has now turned out that they were warned in advance by lawyers about the dangers of using copyrighted materials to train their AI models but, allegedly, did so anyway.

In a recent submission within a copyright infringement lawsuit initially filed this summer, consolidating two legal actions initiated by comedian Sarah Silverman, Pulitzer Prize winner Michael Chabon, and other notable authors against Meta, it is alleged that the company utilized their works without permission for training Llama, a large 65-billion-parameter language model developed by Meta.

The updated complaint, Reuters reports, now includes chat logs featuring Meta-affiliated researcher Tim Dettmers, which capture discussions between Dettmers and Meta's legal team that explore the legality of acquiring a specific dataset of book files within a Discord server and whether such actions would be considered "legally ok". These chat logs are seen as potentially significant evidence, suggesting that Meta may have been aware that the US copyright law might not protect the use of the books.

"At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons," Dettmers wrote in 2021, according to the filing.

For those unfamiliar, Meta's paper describing LLaMA says the AI's training datasets include ThePile, which is "a copy of the contents of the Bibliotik private tracker", making it "flagrantly illegal".

The complaint further alleges that a month prior, Dettmers had stated that Meta's legal team informed him that "the data cannot be used or models cannot be published if they are trained on that data".

As stated in the report, although Dettmers doesn't detail the lawyers' concerns, his colleagues in the chat point to "books with active copyrights" as the likely main source of worry, arguing that training AI models on such data should "fall under fair use".

So far, neither Dettmers nor Meta have commented on the situation. It is also unclear whether Meta's upcoming competitor to GPT-4, reportedly in development, will involve the use of pirated books for training.

You can read the full report here. Also, don't forget to join our 80 Level Talent platform and our Telegram channel, follow us on InstagramTwitter, and LinkedIn, where we share breakdowns, the latest news, awesome artworks, and more.

Join discussion

Comments 0

    You might also like

    We need your consent

    We use cookies on this website to make your browsing experience better. By using the site you agree to our use of cookies.Learn more