如何使用大模型给消费记录进行分类

前言

记账有助于人对自己的财务进行管理，分析出其中不合理的支出，减少不必要的消费。但是记账往往较为繁琐，让人难以坚持。尽管现在人们基本只靠电子支付，可以方便一次性导出自己的消费账单，但是这些账单往往存在于不同平台，同时这些记录无法互通，只能借用第三方的软件进行整合，但是第三方软件无法自动对这些消费记录进行分类，这不利于个人对本月消费情况进行分析，同时每个人都有特色化的分类需求，哪怕有些能够自动分类的软件也无法满足。

本文将通过导出支付宝、微信、拼多多等软件的支付账单，使用Python对数据进行清洗与重构，调用通义大模型的API服务对消费记录进行标签分类，最后导入到钱迹APP中进行保存。

导出账单

微信账单的获取方式

支付宝账单的获取方式

拼多多账单的获取方式

清洗数据

导入所需要的库

import pandas as pd
from tqdm import tqdm
from http import HTTPStatus
from dashscope import Application
import json
import chardet

将微信、支付宝和拼多多账单转换为统一的格式，提取出需要的信息，

def detect_encoding(file_path: str):
    """
    Detects the encoding of a file using chardet.

    Args:
        file_path (str): The path to the file.

    Returns:
        str: The detected encoding.
    """
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

def read_file(path: str, skip_rows: tuple, method: str) -> pd.DataFrame:
    """
    Reads the file and returns a DataFrame based on the specified method.

    Args:
        path (str): The path to the file.
        skip_rows (tuple): A tuple containing the start and end row indices to skip.
        method (str): The method used to determine how to read the file ('weixing', 'zhifubao', or 'pdd').

    Returns:
        pd.DataFrame: The DataFrame containing the file data.
    """
    encoding = detect_encoding(path)
    if method in ('weixing', 'zhifubao'):
        with open(path, 'r', encoding=encoding) as file:
            lines = file.readlines()[skip_rows[0]:skip_rows[1]]
            data = [line.strip().split(',') for line in lines]
            return pd.DataFrame(data)
    elif method == 'pdd':
        return pd.read_csv(path, encoding=encoding)


def process_bill(data: pd.DataFrame, columns: list, method: str) -> pd.DataFrame:
    """
    Processes the DataFrame to extract and format the bill information.

    Args:
        data (pd.DataFrame): The input DataFrame containing the raw data.
        columns (list): The column indices to select from the DataFrame.
        method (str): The method used to determine how to process the data ('weixing', 'zhifubao', or 'pdd').

    Returns:
        pd.DataFrame: The processed DataFrame with standardized bill information.
    """
    bill = data.iloc[:, columns]
    if method == 'pdd':
        bill = bill.loc[bill.iloc[:, 3].isin(['待收货', '待评价', '交易成功'])]
        bill.iloc[:, 3] = '支出'
    bill.columns = ['时间', '金额', '商品名称', '类型', '商户名称']
    bill.update(bill.iloc[:, 1:].replace(' ', '', regex=True))
    bill = bill.loc[bill['类型'].isin(['支出', '收入'])]
    bill = bill.loc[~bill['商户名称'].isin(['拼多多平台商户'])]
    bill['分类'] = None
    bill['备注'] = ''
    bill['金额'] = bill['金额'].str.replace('¥', '')
    bill['时间'] = pd.to_datetime(bill['时间']).dt.strftime('%Y/%m/%d %H:%M:%S')
    return bill.reset_index(drop=True)


def get_bill(path: str, method: str = 'weixing') -> pd.DataFrame:
    """
    Reads and processes the bill data from a file based on the specified method.

    Args:
        path (str): The path to the file.
        method (str): The method used to determine how to read and process the file ('weixing', 'zhifubao', or 'pdd').

    Returns:
        pd.DataFrame: The processed DataFrame containing the bill information.

    Raises:
        ValueError: If the provided method is not supported.
    """

    if method == 'weixing':
        data = read_file(path, (17, -1), method)
        bill = process_bill(data, [0, 5, 3, 4, 2], method)
    elif method == 'zhifubao':
        data = read_file(path, (5, -8), method)
        bill = process_bill(data, [2, 9, 8, 10, 7], method)
    elif method == 'pdd':
        data = read_file(path, (0, 0), method)
        bill = process_bill(data, [6, 1, 0, 9, 4], method)
    else:
        raise ValueError(f"Unsupported method: {method}")
    return bill

建立大模型

创建并登录阿里云账号，进入阿里云百炼的控制台，点击我的应用之后点击新增应用。

之后点击直接创建。在选择模型中设置与本文类似的参数

之后填入Prompt来决定你的模型输出，可以参考本文给出的Prompt，如果想要输出其它类别，可以修改Prompt中限制里面列表的名称，不要忘记修改数字。

# 角色
你是一位专业的商品分类专家，负责根据商品名称为其分配合适的标签。

## 技能
### 技能 1: 商品分类
- 根据用户提供的商品名称，准确地将其归类到相应的标签下。
- 如果遇到不确定的商品，可以使用搜索工具来获取更多信息以帮助分类。
- 确保分类结果的准确性，并能够处理各种不同类型的商品名称。

## 限制
- 只提供商品分类服务，不涉及其他类型的咨询或建议。
- 分类标签仅限于以下列表中的13个类别：['餐食', '衣服', '交通', '话费', '日用品', '住房', '美妆', '医疗', '娱乐', '数码', '运动', '水电']。
- 所有回复必须简洁明了，只需给出分类结果。
- 如果商品名称为美团的交易记录， 则自动标记为餐食

## 返回形式
- 以JSON数据格式返回分类结果。例如，如果确认商品类别为“衣服”，则返回的数据应为 {"Label": "衣服"}。

最后点击发布。

调用模型的代码

def get_output(prompt: str) -> dict:
    """
    Sends a request to the application's API and returns the response output.

    Args:
        prompt (str): The input prompt for the API request.

    Returns:
        dict: The output from the API response, or None if the request fails.
    """
    # Define the application ID and API key
    app_id = 'ab778235f79144b29b24cb6ed1dd1f8d'
    api_key = 'sk-b0e213e453e74d958fb6f2e3e5a50ae8'

    # Send the POST request to the API
    response = Application.call(
        app_id=app_id,
        prompt=prompt,
        api_key=api_key,
    )

    # Check if the response status code is 200 (OK)
    if response.status_code != HTTPStatus.OK:
        # Print error details if the request fails
        print('request_id=%s, code=%s, message=%s\n' % (
            response.request_id, response.status_code, response.message
        ))
        return None
    else:
        # Return the output from the API response
        return response.output

你的app_id在模型发布后既可以看到

你的api_key可以在点击调用，再次点击查看我的API KEY中查看。

创建数据库

消费数据每月都在不断产生，需要每个月按时导入，同时新导入的数据中可能包含一些过去已经导入的数据，所以创建一个本地的数据库，对每次新导入的数据中未保存的数据提取出来，进行大模型的分类。

将新数据与老数据进行合并同时提取出未进行分类的数据。

def get_data(bill: pd.DataFrame) -> (pd.DataFrame, pd.Index):
    """
    Reads existing data from a CSV file, concatenates it with new bill data, and processes the combined data.

    Args:
        bill (pd.DataFrame): The DataFrame containing the new bill data.

    Returns:
        pd.DataFrame: The processed DataFrame with combined data.
        pd.Index: The index of rows where the '分类' column is null.
    """
    try:
        # Read the existing data from the CSV file
        data = pd.read_csv('data.csv')

        # Convert and format the '时间' column
        data['时间'] = pd.to_datetime(data['时间']).dt.strftime('%Y/%m/%d %H:%M:%S')

        # Concatenate the existing data with the new bill data
        data_concat = pd.concat([data, bill], ignore_index=True)

        # Remove duplicate entries based on '时间' and '金额' columns, keeping the first occurrence
        data_concat = data_concat.drop_duplicates(subset=['时间'], keep='first')

        # Sort the combined data by '时间' in descending order
        data_concat = data_concat.sort_values(by='时间', ascending=False).reset_index(drop=True)

    except FileNotFoundError:
        # Handle the case where the CSV file does not exist
        print('文件不存在，第一次创建')
        data_concat = bill

    except pd.errors.EmptyDataError:
        # Handle the case where the CSV file is empty
        print('文件为空，第一次创建')
        data_concat = bill

    except Exception as e:
        # Handle any other unexpected exceptions
        print(f'发生错误: {e}')
        data_concat = bill

    finally:
        # Find the index of rows where the '分类' column is null
        index = data_concat[data_concat['分类'].isnull()].index

    return data_concat, index

对数据进行分类

对提取出来的数据进行标签分类。

def get_category(bill: pd.DataFrame, index: pd.Index) -> (pd.DataFrame, list):
    """
    Categorizes the bill entries based on the provided index and updates the '分类' column.

    Args:
        bill (pd.DataFrame): The DataFrame containing the bill data.
        index (pd.Index): The index of rows to be categorized.

    Returns:
        pd.DataFrame: The updated DataFrame with categorized entries.
        list: A list of indices where categorization failed.
    """
    # Define the list of valid categories
    categories = ['其它', '餐食', '衣服', '交通', '话费', '日用品', '住房', '美妆', '医疗', '娱乐', '数码', '运动', '水电', '未知']

    # Initialize a list to store indices where categorization failed
    errors = []

    # Iterate over the specified index
    for i in tqdm(index, desc="Categorizing entries"):
        # Get the remark for the current row
        prompt = bill.loc[i, '备注']

        # Get the output from the API
        output = get_output(prompt)

        try:
            # Parse the JSON output to get the category label
            category = json.loads(output['text'])['Label']

            # Check if the category is valid
            if category in categories:
                bill.loc[i, '分类'] = category
            else:
                errors.append(i)
                print(f"Invalid category '{category}' for index {i}")

        except (json.JSONDecodeError, KeyError, TypeError):
            # Handle JSON decoding errors, missing keys, and type errors
            errors.append(i)
            print(f"Failed to categorize entry at index {i}: {output['text']}")

    return bill, errors

最后运行数据

def format_remarks(row: pd.Series) -> str:
    """
    Formats the remarks for a given row by combining the product name and company name.

    Args:
        row (pd.Series): A row from the DataFrame containing '商品名称' and '商户名称' columns.

    Returns:
        str: A formatted string with the product name and company name, or '未知' if the values are missing.
    """
    # Get the product name, default to '未知' if it is NaN
    product_name = row['商品名称'] if pd.notna(row['商品名称']) else '未知'

    # Get the company name, default to '未知' if it is NaN
    company_name = row['商户名称'] if pd.notna(row['商户名称']) else '未知'

    # Format the remarks string
    return f'商品名称：{product_name}\n商户名称：{company_name}'


if __name__ == '__main__':
    path = 'bill/alipay_record_20241104_1244_1.csv'
    bill = get_bill(path, method='zhifubao')
    bill['备注'] = bill.apply(format_remarks, axis=1)
    del bill['商品名称']
    del bill['商户名称']
    data, index = get_data(bill)
    data, error = get_category(data, index)
    data.to_csv('data.csv', index=False)

将数据导入钱迹APP

导入数据的教程

结果展示

所以标签都依靠大模型标记，一次性整理几年的消费记录。