* Zhipu sdk update 适配最新的智谱SDK,支持GLM4v (#1502) * 适配 google gemini 优化为从用户input中提取文件 * 适配最新的智谱SDK、支持glm-4v * requirements.txt fix * pending history check --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> * Update "生成多种Mermaid图表" plugin: Separate out the file reading function (#1520) * Update crazy_functional.py with new functionality deal with PDF * Update crazy_functional.py and Mermaid.py for plugin_kwargs * Update crazy_functional.py with new chart type: mind map * Update SELECT_PROMPT and i_say_show_user messages * Update ArgsReminder message in get_crazy_functions() function * Update with read md file and update PROMPTS * Return the PROMPTS as the test found that the initial version worked best * Update Mermaid chart generation function * version 3.71 * 解决issues #1510 * Remove unnecessary text from sys_prompt in 解析历史输入 function * Remove sys_prompt message in 解析历史输入 function * Update bridge_all.py: supports gpt-4-turbo-preview (#1517) * Update bridge_all.py: supports gpt-4-turbo-preview supports gpt-4-turbo-preview * Update bridge_all.py --------- Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * Update config.py: supports gpt-4-turbo-preview (#1516) * Update config.py: supports gpt-4-turbo-preview supports gpt-4-turbo-preview * Update config.py --------- Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * Refactor 解析历史输入 function to handle file input * Update Mermaid chart generation functionality * rename files and functions --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com> Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * 接入mathpix ocr功能 (#1468) * Update Latex输出PDF结果.py 借助mathpix实现了PDF翻译中文并重新编译PDF * Update config.py add mathpix appid & appkey * Add 'PDF翻译中文并重新编译PDF' feature to plugins. --------- Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * fix zhipuai * check picture * remove glm-4 due to bug * 修改config * 检查MATHPIX_APPID * Remove unnecessary code and update function_plugins dictionary * capture non-standard token overflow * bug fix #1524 * change mermaid style * 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 (#1530) * 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 * 微调未果 先stage一下 * update --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * ver 3.72 * change live2d * save the status of ``clear btn` in cookie * 前端选择保持 * js ui bug fix * reset btn bug fix * update live2d tips * fix missing get_token_num method * fix live2d toggle switch * fix persistent custom btn with cookie * fix zhipuai feedback with core functionality * Refactor button update and clean up functions * tailing space removal * Fix missing MATHPIX_APPID and MATHPIX_APPKEY configuration * Prompt fix、脑图提示词优化 (#1537) * 适配 google gemini 优化为从用户input中提取文件 * 脑图提示词优化 * Fix missing MATHPIX_APPID and MATHPIX_APPKEY configuration --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> * 优化“PDF翻译中文并重新编译PDF”插件 (#1602) * Add gemini_endpoint to API_URL_REDIRECT (#1560) * Add gemini_endpoint to API_URL_REDIRECT * Update gemini-pro and gemini-pro-vision model_info endpoints * Update to support new claude models (#1606) * Add anthropic library and update claude models * 更新bridge_claude.py文件,添加了对图片输入的支持。修复了一些bug。 * 添加Claude_3_Models变量以限制图片数量 * Refactor code to improve readability and maintainability * minor claude bug fix * more flexible one-api support * reformat config * fix one-api new access bug * dummy * compat non-standard api * version 3.73 --------- Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com> Co-authored-by: Menghuan1918 <menghuan2003@outlook.com> Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com> Co-authored-by: Hao Ma <893017927@qq.com> Co-authored-by: zeyuan huang <599012428@qq.com>
85 lines
3.8 KiB
Python
85 lines
3.8 KiB
Python
from crazy_functions.crazy_utils import read_and_clean_pdf_text, get_files_from_everything
|
||
import os
|
||
import re
|
||
def extract_text_from_files(txt, chatbot, history):
|
||
"""
|
||
查找pdf/md/word并获取文本内容并返回状态以及文本
|
||
|
||
输入参数 Args:
|
||
chatbot: chatbot inputs and outputs (用户界面对话窗口句柄,用于数据流可视化)
|
||
history (list): List of chat history (历史,对话历史列表)
|
||
|
||
输出 Returns:
|
||
文件是否存在(bool)
|
||
final_result(list):文本内容
|
||
page_one(list):第一页内容/摘要
|
||
file_manifest(list):文件路径
|
||
excption(string):需要用户手动处理的信息,如没出错则保持为空
|
||
"""
|
||
|
||
final_result = []
|
||
page_one = []
|
||
file_manifest = []
|
||
excption = ""
|
||
|
||
if txt == "":
|
||
final_result.append(txt)
|
||
return False, final_result, page_one, file_manifest, excption #如输入区内容不是文件则直接返回输入区内容
|
||
|
||
#查找输入区内容中的文件
|
||
file_pdf,pdf_manifest,folder_pdf = get_files_from_everything(txt, '.pdf')
|
||
file_md,md_manifest,folder_md = get_files_from_everything(txt, '.md')
|
||
file_word,word_manifest,folder_word = get_files_from_everything(txt, '.docx')
|
||
file_doc,doc_manifest,folder_doc = get_files_from_everything(txt, '.doc')
|
||
|
||
if file_doc:
|
||
excption = "word"
|
||
return False, final_result, page_one, file_manifest, excption
|
||
|
||
file_num = len(pdf_manifest) + len(md_manifest) + len(word_manifest)
|
||
if file_num == 0:
|
||
final_result.append(txt)
|
||
return False, final_result, page_one, file_manifest, excption #如输入区内容不是文件则直接返回输入区内容
|
||
|
||
if file_pdf:
|
||
try: # 尝试导入依赖,如果缺少依赖,则给出安装建议
|
||
import fitz
|
||
except:
|
||
excption = "pdf"
|
||
return False, final_result, page_one, file_manifest, excption
|
||
for index, fp in enumerate(pdf_manifest):
|
||
file_content, pdf_one = read_and_clean_pdf_text(fp) # (尝试)按照章节切割PDF
|
||
file_content = file_content.encode('utf-8', 'ignore').decode() # avoid reading non-utf8 chars
|
||
pdf_one = str(pdf_one).encode('utf-8', 'ignore').decode() # avoid reading non-utf8 chars
|
||
final_result.append(file_content)
|
||
page_one.append(pdf_one)
|
||
file_manifest.append(os.path.relpath(fp, folder_pdf))
|
||
|
||
if file_md:
|
||
for index, fp in enumerate(md_manifest):
|
||
with open(fp, 'r', encoding='utf-8', errors='replace') as f:
|
||
file_content = f.read()
|
||
file_content = file_content.encode('utf-8', 'ignore').decode()
|
||
headers = re.findall(r'^#\s(.*)$', file_content, re.MULTILINE) #接下来提取md中的一级/二级标题作为摘要
|
||
if len(headers) > 0:
|
||
page_one.append("\n".join(headers)) #合并所有的标题,以换行符分割
|
||
else:
|
||
page_one.append("")
|
||
final_result.append(file_content)
|
||
file_manifest.append(os.path.relpath(fp, folder_md))
|
||
|
||
if file_word:
|
||
try: # 尝试导入依赖,如果缺少依赖,则给出安装建议
|
||
from docx import Document
|
||
except:
|
||
excption = "word_pip"
|
||
return False, final_result, page_one, file_manifest, excption
|
||
for index, fp in enumerate(word_manifest):
|
||
doc = Document(fp)
|
||
file_content = '\n'.join([p.text for p in doc.paragraphs])
|
||
file_content = file_content.encode('utf-8', 'ignore').decode()
|
||
page_one.append(file_content[:200])
|
||
final_result.append(file_content)
|
||
file_manifest.append(os.path.relpath(fp, folder_word))
|
||
|
||
return True, final_result, page_one, file_manifest, excption |