ChatGPT is no CPA: Popular chatbot can’t pass accounting test

PROVO, Utah — ChatGPT is many things, but it’s no accountant. Researchers say the artificial intelligence chatbot has trouble understanding math and makes up facts to cover up its mistakes. The new findings identify areas of improvement for the revolutionary AI program. However, it might also give a sense of relief to people who fear that robots will one day take over their jobs.

The latest AI chatbot has been generating a lot of buzz recently. Since its release, it has become the fastest-growing technology platform, reaching 100 million users in under two months. It has successfully passed the bar exam with a score in the 90th percentile, passed 13 of 15 AP exams, and got a nearly perfect score on the verbal portion of the GRE. ChatGPT (or Chatbot Generative Pre-trained Transformer) works using machine learning to generate natural language text. Despite its accomplishments, there have also been a wave of concerns.

“When this technology first came out, everyone was worried that students could now use it to cheat,” says lead study author David Wood, a Brigham Young University professor of accounting. “But opportunities to cheat have always existed. So for us, we’re trying to focus on what we can do with this technology now that we couldn’t do before to improve the teaching process for faculty and the learning process for students. Testing it out was eye-opening.”

The authors focused their attention on accounting exams and how ChatGPT fares against actual accounting students. The findings may help in understanding how to incorporate ChatGPT into education.

ChatGPT on smartphone
(Photo by Tada Images on Shutterstock)

There were 186 educational institutions in 14 countries that participated in the research, making up 25,181 classroom accounting exam questions. They also recruited undergrad students at BYU to feed another 2,268 textbook test bank questions to ChatGPT. The questions ranged from accounting information systems, auditing, financial accounting, managerial accounting, and tax. They were also presented in different formats and difficulty with some being true/false, multiple choice, and written responses.

Students scored higher than ChatGPT. While the AI landed solid score of 47.4 percent, students averaged a score of 76.7 percent. The performance scores differed based on the questions. On 11.3 percent of questions, ChatGPT scored higher than students especially when it came to dealing with auditing and accounting information systems.

However, the bot was not so well-versed on tax, financial, and managerial assessments. One reason for this gap in knowledge is that ChatGPT had trouble with the math required for these questions. This conclusion came from seeing a trend in ChatGPT’s math where it makes nonsensical errors such as adding two numbers in a subtraction problem or dividing numbers incorrectly.

Question type also played a factor in how well ChatGPT could rack up points. It did better on true/false (68.7% correct) and multiple-choice questions (59.5%). When it came to short-answer questions, however, ChatGPT did not do as well scoring between 28.7 and 39.1 percent. To compensate for the lack of knowledge, ChatGPT used authoritatively-written descriptions for incorrect answers or answered the same question in different ways.

“It’s not perfect; you’re not going to be using it for everything,” says Jessica Wood, a freshman at BYU who participated in the study, in a university release. “Trying to learn solely by using ChatGPT is a fool’s errand.”

You might also be interested in:

Since ChatGPT is fairly new, there is a real possibility AI updates would help with better answering accounting-related questions in the future. The most promising aspect, according to the authors, is how the chatbot can reshape how professors approach teaching and learning.

“It’s an opportunity to reflect on whether we are teaching value-added information or not,” says study co-author and BYU accounting professor Melissa Larson. “This is a disruption, and we need to assess where we go from here. Of course, I’m still going to have [teaching assistants], but this is going to force us to use them in different ways.”

The study is published in the journal Issues in Accounting Education.

How does ChatGPT work?

According to ChatGPT itself, the program is a language model based on the GPT-4 architecture developed by OpenAI. It is designed to understand and generate human-like responses in a conversational context. The underlying technology, GPT-4, is an advanced iteration of the GPT series and improves upon its predecessors in terms of scale and performance. Here’s an overview of how ChatGPT works:

  1. Pre-training: ChatGPT is pre-trained on a large body of text data from diverse sources like books, articles, and websites. During this phase, the model learns the structure and patterns in human language, such as grammar, syntax, semantics, and even some factual information. However, it is essential to note that the knowledge acquired during pre-training is limited to the information available in the training data, which has a cutoff date.
  2. Fine-tuning: After the pre-training phase, ChatGPT is fine-tuned using a narrower dataset, typically containing conversations or dialogue samples. This dataset may be generated with the help of human reviewers following specific guidelines. The fine-tuning process helps the model learn to generate more contextually relevant and coherent responses in a conversational setting.
  3. Transformer architecture: ChatGPT is based on the transformer architecture, which allows it to efficiently process and generate text. It uses self-attention mechanisms to weigh the importance of words in a given context and to capture long-range dependencies in language. This architecture enables the model to understand and generate complex and contextually appropriate responses.
  4. Tokenization: When a user inputs text, ChatGPT first tokenizes the text into smaller units called tokens. These tokens can represent characters, words, or subwords, depending on the language and tokenization strategy used. The model processes these tokens in parallel, allowing it to generate context-aware responses quickly.
  5. Decoding: After processing the input tokens and generating a context vector, ChatGPT decodes the output by generating a sequence of tokens that form the response. This is typically done using a greedy search, beam search, or other decoding strategies to select the most likely next token based on the model’s predictions.
  6. Interactive conversation: ChatGPT maintains a conversation history to keep track of the context during a dialogue. This history is fed back into the model during each interaction, enabling it to generate contextually coherent responses.

It’s important to note that the AI program actually admits that it has limitations, such as generating incorrect or nonsensical answers, being sensitive to input phrasing, being excessively verbose, or not asking clarifying questions for ambiguous queries. OpenAI adds that it continually works on improving these aspects and refining the model to make it more effective and safer for the public to use.

YouTube video

Follow on Google News

About the Author

Jocelyn Solis-Moreira

Jocelyn is a New York-based science journalist whose work has appeared in Discover Magazine, Health, and Live Science, among other publications. She holds a Master’s of Science in Psychology with a concentration in behavioral neuroscience and a Bachelor’s of Science in integrative neuroscience from Binghamton University. Jocelyn has reported on several medical and science topics ranging from coronavirus news to the latest findings in women’s health.

The contents of this website do not constitute advice and are provided for informational purposes only. See our full disclaimer