前言
记账有助于人对自己的财务进行管理,分析出其中不合理的支出,减少不必要的消费。但是记账往往较为繁琐,让人难以坚持。尽管现在人们基本只靠电子支付,可以方便一次性导出自己的消费账单,但是这些账单往往存在于不同平台,同时这些记录无法互通,只能借用第三方的软件进行整合,但是第三方软件无法自动对这些消费记录进行分类,这不利于个人对本月消费情况进行分析,同时每个人都有特色化的分类需求,哪怕有些能够自动分类的软件也无法满足。
本文将通过导出支付宝、微信、拼多多等软件的支付账单,使用Python对数据进行清洗与重构,调用通义大模型的API服务对消费记录进行标签分类,最后导入到钱迹APP中进行保存。
导出账单
清洗数据
导入所需要的库
import pandas as pd
from tqdm import tqdm
from http import HTTPStatus
from dashscope import Application
import json
import chardet
将微信、支付宝和拼多多账单转换为统一的格式,提取出需要的信息,
def detect_encoding(file_path: str):
"""
Detects the encoding of a file using chardet.
Args:
file_path (str): The path to the file.
Returns:
str: The detected encoding.
"""
with open(file_path, 'rb') as f:
result = chardet.detect(f.read())
return result['encoding']
def read_file(path: str, skip_rows: tuple, method: str) -> pd.DataFrame:
"""
Reads the file and returns a DataFrame based on the specified method.
Args:
path (str): The path to the file.
skip_rows (tuple): A tuple containing the start and end row indices to skip.
method (str): The method used to determine how to read the file ('weixing', 'zhifubao', or 'pdd').
Returns:
pd.DataFrame: The DataFrame containing the file data.
"""
encoding = detect_encoding(path)
if method in ('weixing', 'zhifubao'):
with open(path, 'r', encoding=encoding) as file:
lines = file.readlines()[skip_rows[0]:skip_rows[1]]
data = [line.strip().split(',') for line in lines]
return pd.DataFrame(data)
elif method == 'pdd':
return pd.read_csv(path, encoding=encoding)
def process_bill(data: pd.DataFrame, columns: list, method: str) -> pd.DataFrame:
"""
Processes the DataFrame to extract and format the bill information.
Args:
data (pd.DataFrame): The input DataFrame containing the raw data.
columns (list): The column indices to select from the DataFrame.
method (str): The method used to determine how to process the data ('weixing', 'zhifubao', or 'pdd').
Returns:
pd.DataFrame: The processed DataFrame with standardized bill information.
"""
bill = data.iloc[:, columns]
if method == 'pdd':
bill = bill.loc[bill.iloc[:, 3].isin(['待收货', '待评价', '交易成功'])]
bill.iloc[:, 3] = '支出'
bill.columns = ['时间', '金额', '商品名称', '类型', '商户名称']
bill.update(bill.iloc[:, 1:].replace(' ', '', regex=True))
bill = bill.loc[bill['类型'].isin(['支出', '收入'])]
bill = bill.loc[~bill['商户名称'].isin(['拼多多平台商户'])]
bill['分类'] = None
bill['备注'] = ''
bill['金额'] = bill['金额'].str.replace('¥', '')
bill['时间'] = pd.to_datetime(bill['时间']).dt.strftime('%Y/%m/%d %H:%M:%S')
return bill.reset_index(drop=True)
def get_bill(path: str, method: str = 'weixing') -> pd.DataFrame:
"""
Reads and processes the bill data from a file based on the specified method.
Args:
path (str): The path to the file.
method (str): The method used to determine how to read and process the file ('weixing', 'zhifubao', or 'pdd').
Returns:
pd.DataFrame: The processed DataFrame containing the bill information.
Raises:
ValueError: If the provided method is not supported.
"""
if method == 'weixing':
data = read_file(path, (17, -1), method)
bill = process_bill(data, [0, 5, 3, 4, 2], method)
elif method == 'zhifubao':
data = read_file(path, (5, -8), method)
bill = process_bill(data, [2, 9, 8, 10, 7], method)
elif method == 'pdd':
data = read_file(path, (0, 0), method)
bill = process_bill(data, [6, 1, 0, 9, 4], method)
else:
raise ValueError(f"Unsupported method: {method}")
return bill
建立大模型
创建并登录阿里云账号,进入阿里云百炼的控制台,点击我的应用之后点击新增应用。
之后点击直接创建。在选择模型中设置与本文类似的参数
之后填入Prompt来决定你的模型输出,可以参考本文给出的Prompt,如果想要输出其它类别,可以修改Prompt中限制里面列表的名称,不要忘记修改数字。
# 角色
你是一位专业的商品分类专家,负责根据商品名称为其分配合适的标签。
## 技能
### 技能 1: 商品分类
- 根据用户提供的商品名称,准确地将其归类到相应的标签下。
- 如果遇到不确定的商品,可以使用搜索工具来获取更多信息以帮助分类。
- 确保分类结果的准确性,并能够处理各种不同类型的商品名称。
## 限制
- 只提供商品分类服务,不涉及其他类型的咨询或建议。
- 分类标签仅限于以下列表中的13个类别:['餐食', '衣服', '交通', '话费', '日用品', '住房', '美妆', '医疗', '娱乐', '数码', '运动', '水电']。
- 所有回复必须简洁明了,只需给出分类结果。
- 如果商品名称为美团的交易记录, 则自动标记为餐食
## 返回形式
- 以JSON数据格式返回分类结果。例如,如果确认商品类别为“衣服”,则返回的数据应为 {"Label": "衣服"}。
最后点击发布。
调用模型的代码
def get_output(prompt: str) -> dict:
"""
Sends a request to the application's API and returns the response output.
Args:
prompt (str): The input prompt for the API request.
Returns:
dict: The output from the API response, or None if the request fails.
"""
# Define the application ID and API key
app_id = 'ab778235f79144b29b24cb6ed1dd1f8d'
api_key = 'sk-b0e213e453e74d958fb6f2e3e5a50ae8'
# Send the POST request to the API
response = Application.call(
app_id=app_id,
prompt=prompt,
api_key=api_key,
)
# Check if the response status code is 200 (OK)
if response.status_code != HTTPStatus.OK:
# Print error details if the request fails
print('request_id=%s, code=%s, message=%s\n' % (
response.request_id, response.status_code, response.message
))
return None
else:
# Return the output from the API response
return response.output
你的app_id在模型发布后既可以看到
你的api_key可以在点击调用,再次点击查看我的API KEY中查看。
创建数据库
消费数据每月都在不断产生,需要每个月按时导入,同时新导入的数据中可能包含一些过去已经导入的数据,所以创建一个本地的数据库,对每次新导入的数据中未保存的数据提取出来,进行大模型的分类。
将新数据与老数据进行合并同时提取出未进行分类的数据。
def get_data(bill: pd.DataFrame) -> (pd.DataFrame, pd.Index):
"""
Reads existing data from a CSV file, concatenates it with new bill data, and processes the combined data.
Args:
bill (pd.DataFrame): The DataFrame containing the new bill data.
Returns:
pd.DataFrame: The processed DataFrame with combined data.
pd.Index: The index of rows where the '分类' column is null.
"""
try:
# Read the existing data from the CSV file
data = pd.read_csv('data.csv')
# Convert and format the '时间' column
data['时间'] = pd.to_datetime(data['时间']).dt.strftime('%Y/%m/%d %H:%M:%S')
# Concatenate the existing data with the new bill data
data_concat = pd.concat([data, bill], ignore_index=True)
# Remove duplicate entries based on '时间' and '金额' columns, keeping the first occurrence
data_concat = data_concat.drop_duplicates(subset=['时间'], keep='first')
# Sort the combined data by '时间' in descending order
data_concat = data_concat.sort_values(by='时间', ascending=False).reset_index(drop=True)
except FileNotFoundError:
# Handle the case where the CSV file does not exist
print('文件不存在,第一次创建')
data_concat = bill
except pd.errors.EmptyDataError:
# Handle the case where the CSV file is empty
print('文件为空,第一次创建')
data_concat = bill
except Exception as e:
# Handle any other unexpected exceptions
print(f'发生错误: {e}')
data_concat = bill
finally:
# Find the index of rows where the '分类' column is null
index = data_concat[data_concat['分类'].isnull()].index
return data_concat, index
对数据进行分类
对提取出来的数据进行标签分类。
def get_category(bill: pd.DataFrame, index: pd.Index) -> (pd.DataFrame, list):
"""
Categorizes the bill entries based on the provided index and updates the '分类' column.
Args:
bill (pd.DataFrame): The DataFrame containing the bill data.
index (pd.Index): The index of rows to be categorized.
Returns:
pd.DataFrame: The updated DataFrame with categorized entries.
list: A list of indices where categorization failed.
"""
# Define the list of valid categories
categories = ['其它', '餐食', '衣服', '交通', '话费', '日用品', '住房', '美妆', '医疗', '娱乐', '数码', '运动', '水电', '未知']
# Initialize a list to store indices where categorization failed
errors = []
# Iterate over the specified index
for i in tqdm(index, desc="Categorizing entries"):
# Get the remark for the current row
prompt = bill.loc[i, '备注']
# Get the output from the API
output = get_output(prompt)
try:
# Parse the JSON output to get the category label
category = json.loads(output['text'])['Label']
# Check if the category is valid
if category in categories:
bill.loc[i, '分类'] = category
else:
errors.append(i)
print(f"Invalid category '{category}' for index {i}")
except (json.JSONDecodeError, KeyError, TypeError):
# Handle JSON decoding errors, missing keys, and type errors
errors.append(i)
print(f"Failed to categorize entry at index {i}: {output['text']}")
return bill, errors
最后运行数据
def format_remarks(row: pd.Series) -> str:
"""
Formats the remarks for a given row by combining the product name and company name.
Args:
row (pd.Series): A row from the DataFrame containing '商品名称' and '商户名称' columns.
Returns:
str: A formatted string with the product name and company name, or '未知' if the values are missing.
"""
# Get the product name, default to '未知' if it is NaN
product_name = row['商品名称'] if pd.notna(row['商品名称']) else '未知'
# Get the company name, default to '未知' if it is NaN
company_name = row['商户名称'] if pd.notna(row['商户名称']) else '未知'
# Format the remarks string
return f'商品名称:{product_name}\n商户名称:{company_name}'
if __name__ == '__main__':
path = 'bill/alipay_record_20241104_1244_1.csv'
bill = get_bill(path, method='zhifubao')
bill['备注'] = bill.apply(format_remarks, axis=1)
del bill['商品名称']
del bill['商户名称']
data, index = get_data(bill)
data, error = get_category(data, index)
data.to_csv('data.csv', index=False)
将数据导入钱迹APP
结果展示
所以标签都依靠大模型标记,一次性整理几年的消费记录。