Summary
Audio Summmary
OpenAI has increased its number of paying enterprise customers from 2 million to 3 million in just 4 months. The increase reflects the company’s attractiveness as an “AI-native” company, rather than a company that integrates AI into existing products, and could also indicate a greater readiness among corporate users to integrate AI into production. Meanwhile, Reddit is suing Anthropic for allegedly having used data from its discussion groups to train AI models, without permission or compensation.
Microsoft’s carbon emissions are up 23.4% since 2020, largely due to the increased need for data centers. However, 97% of the 2024 carbon footprint were Scope 3 emissions – these are emissions considered to be outside of a company’s direct control like those linked to transportation, raw materials in chips and hardware, as well as to the preparation of steel and concrete for building data centers. In an Irish Times article, Meta’s chief privacy officer is largely critical of European legislation, claiming that it results from “a vocal minority of activist groups to delay AI innovation”. The EU has recently fined Meta 200 million EUR under the Digital Services Act, accusing the company of “using misleading practices and unclear terms to steer users” to the option of having their personal data processed, degrading the service for those who do not.
BOND, a global technology investment firm, has published a data-detailed presentation on the current and predicted status of AI. The report notes that the cost of model training is increasing though inference costs are falling, and that the rise of China and open-source models are risks to the monetization of AI. Elsewhere, an MIT Technology Review article presents several cases of language model hallucination in evidence submitted in court trials. The real risk is that, at some point, a judge will make a decision based on evidence that has been partly hallucinated by AI.
As part of its campaign to reduce federal funding, the Trump administration has canceled more than 100 research projects relating to climate change. The administration has come up with a new science budget proposal that entails a 60% cut in overall research spending, with the near elimination of clean energy technology research.
New research measures large language model memorization – the degree to which a model can retain data points seen in training. Using an information theoretical approach, the researchers find that GPT-style transformers can memorize between 3.5 and 4 bits of information per parameter. The researchers postulate that modern language models are trained on too much data to be able to test whether specific data points have been memorized from the model’s training data – or simply inferred by the model.
Table of Contents
1. How AI is introducing errors into courtrooms
2. Trends – Artificial Intelligence – BOND
3. European Union risks becoming the ‘museum of the world’, says Meta privacy chief
4. The Trump administration has shut down more than 100 climate studies
5. Breakneck data center growth challenges Microsoft’s sustainability goals
6. Reddit sues Anthropic for allegedly not paying for training data
7. OpenAI hits 3M business users and launches workplace tools to take on Microsoft
1. How AI is introducing errors into courtrooms
This MIT Technology Review article presents several instances of language model hallucination in evidence submitted in court trials. In one example in California, the law firm Ellis George submitted arguments that cited articles that did not exist. The firm admitted to writing the arguments using Google Gemini and a specialized tool called CoCounsel. The CoCounsel website mentions that its AI is “backed by authoritative content”. The judge fined Ellis George 31’000 USD. Another example comes from a copyright lawsuit against Anthropic brought by record labels. A filing by Anthropic included a legal citation created by Claude with an incorrect title and incorrect author name. Though the errors were discovered in both of these cases, the risk is that at some point a judge will make a decision based on evidence that has been partly hallucinated by AI. The cases also highlight the pressure that law firms are under to leverage tools that help write documents under short deadlines. Verifying legal documents is typically a task entrusted to junior employees, and experts believe that the problem of hallucination in court filings will increase as law firms seek to reduce employee costs.
2. Trends – Artificial Intelligence – BOND
BOND, a global technology investment firm, has published a data-detailed 340 slide presentation on the current and predicted status of AI. Three interesting points raised are:
- The cost of model training is increasing. The computing power for training in terms of FLOPs (floating-point operations per second) has been increasing steadily at the rate of 360% per year. The training dataset size has been growing at the rate of 250% for the last 15 years. These increases are very slightly offset by progress in compute scaling and algorithms. The financial cost of model training has increased 2’400 times over the last eight years. A large model costs around 100 million USD to train in 2024, but Anthropic co-founder Dario Amodei believes the cost could rise to 100 billion USD by 2027. That said, model inference time per token is falling.
- ChatGPT is one of the most successful Internet services ever with 800 million weekly active users. It took ChatGPT under 3 months to reach 100 million users – faster than TikTok and Fortnite. In comparison, Netflix took over 10 years. Whereas the top uses of ChatGPT today are writing, therapy, role playing people you need, automating repetitive work, etc., the report foresees that the most common uses for the service in 10 years time will include conducting scientific research, modeling full biological systems, operating autonomous companies, designing advanced technologies and simulating human like minds.
- In the search for monetization by Big Tech, the rise of China and increased use of open-source models are risks. Capital expenditure by Big Tech on model development has been increasing around 21% per year with the need for data centers and faster infrastructure. The avenues for monetization include chips (e.g., Nvidia’s quarterly year-to-year revenue was up 78% at the end of 2024, and Google’s TPU chip sales reached 8.9 billion USD), Cloud Computing (e.g., CoreWeave’s revenue was up 730% in 2024), AI Infrastructure (e.g., Oracle’s revenue rose to 948 million USD over two years), Data Labeling & Evaluation, Data Storage & Management, the development of foundational AI models (e.g., OpenAI’s revenue rose to 3.7 billion USD annually), and API & Generative Search (e.g., Anthropic’s annualized revenue rose 2 billion USD within eighteen months).
3. European Union risks becoming the ‘museum of the world’, says Meta privacy chief
This Irish Times article has an interview with Erin Egan, the chief privacy officer at Meta. She is largely critical of European legislation, claiming that it results from “a vocal minority of activist groups to delay AI innovation” that “ultimately harms consumers and businesses who can benefit from these technologies”. Meta claims that their personalized advertising is worth 200 billion EUR to EU businesses, affecting 1.5 million jobs. At the same time, Meta has been fined several times for breaches of the GDPR privacy regulations – with one fine in May 2023 costing the company 1.2 billion EUR. The EU has recently fined Meta 200 million EUR under the Digital Services Act, accusing the company of “using misleading practices and unclear terms to steer users” to the option of having their personal data processed, degrading the service for those who do not. Above all, Meta is worried about the number of digital regulations enacted in the EU, which makes operating in the EU difficult. This view is echoed in a report by politician Mario Draghi published recently which points to “inconsistent and restrictive regulations” as hindering technical companies, European ones included. Meta warned in 2022 that it would cease operating in the EU if no new framework was adopted for data transfers between Europe and the US.
4. The Trump administration has shut down more than 100 climate studies
As part of its campaign to reduce federal funding, the Trump administration has canceled more than 100 research projects relating to climate change. A search of the National Science Foundation (NSF) database for “climate change”, “clean energy”, “climate adaptation”, “environmental justice” and “climate justice” indicates 118 canceled projects receiving over 100 million USD in funding. Overall, the amount of funding to be cut could be 10 times that figure. For Michael Kratsios, head of the White House Office of Science and Technology Policy, “political biases have displaced the vital search for truth”. The administration has come up with a new NSF budget proposal that entails a 60% cut in overall research spending, with the near elimination of clean energy technology research. The US Global Change Research Program is being cut by 97%, and there is an 80% cut in the Ocean Observatories Initiative. For many researchers, these cuts are ideologically motivated and the administration wants to undermine the power of the universities. Harvard University has launched a lawsuit against the administration which, in retaliation, is now threatening to cut all of its NSF funding cut.
5. Breakneck data center growth challenges Microsoft’s sustainability goals
This TechCrunch article looks at the struggle by Microsoft to meets its sustainability objectives. The company’s carbon emissions are up 23.4% since 2020, largely due to the increased need for data centers. The US electricity grid uses a large amount of fossil fuels, and Microsoft’s “electricity consumption has grown faster than the grids where we operate have decarbonized”. However, 97% of the company’s carbon footprint in 2024 were Scope 3 emissions – these are emissions considered to be outside of a company’s direct control such as emissions linked to transportation, raw materials in chips and hardware, building materials (e.g., steel and concrete preparation produces high amounts of carbon dioxide), and emissions in purchased services. For instance, semiconductor lithography uses chemicals like hexafluoroethane which is a potent greenhouse gas as one ton produces 9’200 tons of carbon dioxide. At the same time, Microsoft has invested in solar power and its zero-carbon electricity capacity is now at 34 gigawatts. The company has also invested in startups working to decarbonize steel and cement production.
6. Reddit sues Anthropic for allegedly not paying for training data
Reddit is suing Anthropic for allegedly having used data from Reddit discussion groups to train AI models, without permission or compensation. Reddit asked Anthropic to refrain from scraping its website in 2024 and, despite assurances from Anthropic, Reddit claims that the AI company has scraped the website more than 100’000 times since then. A company spokesman is quoted as saying: “we will not tolerate profit-seeking entities like Anthropic commercially exploiting Reddit content for billions of dollars without any return for redditors or respect for their privacy”. Reddit is asking for damages in the amount that Anthropic has been enriched by the content scraped. OpenAI and Google have already signed deals with Reddit which allow those firms’ AI models to be trained on Reddit content. It is noteworthy that OpenAI CEO, Sam Altman, owns an 8.7% stake in Reddit and was formerly on the board of directors.
7. OpenAI hits 3M business users and launches workplace tools to take on Microsoft
OpenAI has increased its number of paying enterprise customers from 2 million to 3 million in just 4 months. The increase reflects the company’s attractiveness as an “AI-native” company, rather than a company that integrates AI into existing products, and could also indicate a greater readiness among corporate users to integrate AI into production environments. Whereas the company was urging caution in the use of AI only one year ago, CEO Sam Altman has pivoted on this position and is urging companies to adopt AI immediately. The company’s enterprise offer includes “connectors” that gives ChatGPT direct access to data in Dropbox, SharePoint, OneDrive, and GoogleDrive. OpenAI’s Deep Research feature permits multi-step research tasks by gathering and synthesizing data from Microsoft and Google tools. The Record Mode tool allows MS Team users to transcribe and summarize meetings, and there is a new version of the Codex software engineering agent, based on the company’s new o3 reasoning model.
OpenAI has important challenges nonetheless. It has lost top talent to Anthropic; the article cites a report that found OpenAI employees 8 times more likely to leave for Anthropic compared to transfers in the other direction. The company’s relationship with Microsoft continues though Bing users now have access to OpenAI’s Sora video generation tool without having to pay the 20 USD monthly subscription for ChatGPT.
8. How much do language models memorize?
One of the big legal questions around today’s large language models is whether they are trained on a particular data set – a question pertinent in several current IP lawsuits for example. Knowledge from models can come from memorization (data points the model has seen during training) or from generalization (data that is inferred from memorized data). This research from Meta, Google, Cornell and Nvidia proposes an information-theoretical definition of memorization, which is the number of bits of information per model parameter. They find that GPT-style transformers can memorize between 3.5 and 4 bits of information per parameter. This was measured by training models on random bit-strings which effectively disabled model generalization. The memorization value remains stable despite the fact that models are increasingly trained on larger volumes of training data. They cite a state of the art 8-billion parameter model (32 GB on a disk) having been trained on 15 trillion tokens (around 7 TB on disk). The research shows that models use memorization up to a point, at which stage generalization takes over as the model learns reusable patterns from the data samples. Bigger models enable more memorization, but bigger training datasets make it harder to determine whether some data point is part of the training data or if it is generalized. The authors postulate that modern language models are trained on too much data to be able to test whether some data are part of the training data.