Skip to content

Commit

Permalink
v4.6-4 (#473)
Browse files Browse the repository at this point in the history
  • Loading branch information
c121914yu authored Nov 15, 2023
1 parent bfd8be5 commit cd3acb4
Show file tree
Hide file tree
Showing 39 changed files with 453 additions and 156 deletions.
Binary file added docSite/assets/imgs/datasetSetting1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 3 additions & 2 deletions docSite/content/docs/installation/upgrading/46.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,5 +50,6 @@ curl --location --request POST 'https://{{host}}/api/admin/initv46-2' \
1. 新增 - 团队空间
2. 新增 - 多路向量(多个向量映射一组数据)
3. 新增 - tts语音
4. 线上环境新增 - ReRank向量召回,提高召回精度
5. 优化 - 知识库导出,可直接触发流下载,无需等待转圈圈
4. 新增 - 支持知识库配置文本预处理模型
5. 线上环境新增 - ReRank向量召回,提高召回精度
6. 优化 - 知识库导出,可直接触发流下载,无需等待转圈圈
8 changes: 4 additions & 4 deletions docSite/content/docs/pricing.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
title: '定价'
description: 'FastGPT 的定价'
title: '线上版定价'
description: 'FastGPT 线上版定价'
icon: 'currency_yen'
draft: false
toc: true
weight: 10
weight: 11
---

## Tokens 说明
Expand All @@ -15,7 +15,7 @@ weight: 10

## FastGPT 线上计费

目前,FastGPT 线上计费也仅按 Tokens 使用数量为准。以下是详细的计费表(最新定价以线上表格为准,可在点击充值后实时获取):
使用: [https://fastgpt.run](https://fastgpt.run)[https://ai.fastgpt.in](https://ai.fastgpt.in) 只需仅按 Tokens 使用数量扣费即可。可在 账号-使用记录 中查看具体使用情况,以下是详细的计费表(最新定价以线上表格为准,可在点击充值后实时获取):

{{< table "table-hover table-striped-columns" >}}
| 计费项 | 价格: 元/ 1K tokens(包含上下文) |
Expand Down
20 changes: 14 additions & 6 deletions docSite/content/docs/use-cases/datasetEngine.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "知识库结构讲解"
description: "本节会介绍 FastGPT 知识库结构设计,理解其 QA 的存储格式和检索格式,以便更好的构建知识库。这篇介绍主要以使用为主,详细原理不多介绍。"
description: "本节会详细介绍 FastGPT 知识库结构设计,理解其 QA 的存储格式和多向量映射,以便更好的构建知识库。这篇介绍主要以使用为主,详细原理不多介绍。"
icon: "dataset"
draft: false
toc: true
Expand All @@ -25,13 +25,21 @@ FastGPT 采用了 RAG 中的 Embedding 方案构建知识库,要使用好 Fast

FastGPT 采用了 `PostgresSQL``PG Vector` 插件作为向量检索器,索引为`HNSW`。且`PostgresSQL`仅用于向量检索,`MongoDB`用于其他数据的存取。

`PostgresSQL`的表中,设置一个 `index` 字段用于存储向量、一个 `q` 字段用于存储向量对应的内容,以及一个 `a` 字段用于检索映射。之所以取字段为 `qa` 是由于一些历史缘故,无需完全解为 “问答对” 的格式。在实际使用过程中,可以利用`q``a`的组合,对检索后的内容做进一步的声明,提高大模型的理解力(注意,这里不直接提高搜索精度)
`PostgresSQL`的表中,设置一个 `index` 字段用于存储向量,以及一个`data_id`用于在`MongoDB`中寻找对应的映射值。多个`index`可以对应一组`data_id`,也就是说,一组向量可以对应多组数据。在进行检索时,相同数据会进行合并

目前,提高向量搜索的精度,主要可以通过几种途径:
![](/imgs/datasetSetting1.png)

1. 精简`q`的内容,减少向量内容的长度:当`q`的内容更少,更准确时,检索精度自然会提高。但与此同时,会牺牲一定的检索范围,适合答案较为严格的场景。
2. 更好分词分段:当一段话的结构和语义是完整的,并且是单一的,精度也会提高。因此,许多系统都会优化分词器,尽可能的保障每组数据的完整性。
3. 多样性文本:为一段内容增加关键词、摘要、相似问题等描述性信息,可以使得该内容的向量具有更大的检索覆盖范围。
## 多向量的目的和使用方式

在一组数据中,如果我们希望它尽可能长,但语义又要在向量中尽可能提现,则没有办法通过一组向量来表示。因此,我们采用了多向量映射的方式,将一组数据映射到多组向量中,从而保障数据的完整性和语义的提现。

你可以为一组较长的文本,添加多组向量,从而在检索时,只要其中一组向量被检索到,该数据也将被召回。

## 提高向量搜索精度的方法

1. 更好分词分段:当一段话的结构和语义是完整的,并且是单一的,精度也会提高。因此,许多系统都会优化分词器,尽可能的保障每组数据的完整性。
2. 精简`index`的内容,减少向量内容的长度:当`index`的内容更少,更准确时,检索精度自然会提高。但与此同时,会牺牲一定的检索范围,适合答案较为严格的场景。
3. 丰富`index`的数量,可以为同一个`chunk`内容增加多组`index`
4. 优化检索词:在实际使用过程中,用户的问题通常是模糊的或是缺失的,并不一定是完整清晰的问题。因此优化用户的问题(检索词)很大程度上也可以提高精度。
5. 微调向量模型:由于市面上直接使用的向量模型都是通用型模型,在特定领域的检索精度并不高,因此微调向量模型可以很大程度上提高专业领域的检索效果。

Expand Down
4 changes: 2 additions & 2 deletions packages/global/common/string/textSplitter.ts
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ export const splitText2Chunks = (props: { text: string; maxLen: number; overlapL
let chunks: string[] = [];
for (let i = 0; i < splitTexts.length; i++) {
let text = splitTexts[i];
let chunkToken = countPromptTokens(lastChunk, '');
const textToken = countPromptTokens(text, '');
let chunkToken = lastChunk.length;
const textToken = text.length;

// next chunk is too large / new chunk is too large(The current chunk must be smaller than maxLen)
if (textToken >= maxLen || chunkToken + textToken > maxLen * 1.4) {
Expand Down
6 changes: 4 additions & 2 deletions packages/global/core/dataset/type.d.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import type { VectorModelItemType } from '../../core/ai/model.d';
import type { LLMModelItemType, VectorModelItemType } from '../../core/ai/model.d';
import { PermissionTypeEnum } from '../../support/permission/constant';
import { PushDatasetDataChunkProps } from './api';
import {
Expand All @@ -19,6 +19,7 @@ export type DatasetSchemaType = {
avatar: string;
name: string;
vectorModel: string;
agentModel: string;
tags: string[];
type: `${DatasetTypeEnum}`;
permission: `${PermissionTypeEnum}`;
Expand Down Expand Up @@ -84,8 +85,9 @@ export type CollectionWithDatasetType = Omit<DatasetCollectionSchemaType, 'datas
};

/* ================= dataset ===================== */
export type DatasetItemType = Omit<DatasetSchemaType, 'vectorModel'> & {
export type DatasetItemType = Omit<DatasetSchemaType, 'vectorModel' | 'agentModel'> & {
vectorModel: VectorModelItemType;
agentModel: LLMModelItemType;
isOwner: boolean;
canWrite: boolean;
};
Expand Down
2 changes: 2 additions & 0 deletions packages/global/support/wallet/bill/api.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ import { BillListItemType } from './type';

export type CreateTrainingBillProps = {
name: string;
vectorModel?: string;
agentModel?: string;
};

export type ConcatBillProps = {
Expand Down
1 change: 0 additions & 1 deletion packages/service/core/app/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,6 @@ const AppSchema = new Schema({

try {
AppSchema.index({ updateTime: -1 });
AppSchema.index({ 'share.collection': -1 });
} catch (error) {
console.log(error);
}
Expand Down
1 change: 0 additions & 1 deletion packages/service/core/dataset/collection/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ const DatasetCollectionSchema = new Schema({

try {
DatasetCollectionSchema.index({ datasetId: 1 });
DatasetCollectionSchema.index({ userId: 1 });
DatasetCollectionSchema.index({ updateTime: -1 });
} catch (error) {
console.log(error);
Expand Down
5 changes: 5 additions & 0 deletions packages/service/core/dataset/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ const DatasetSchema = new Schema({
required: true,
default: 'text-embedding-ada-002'
},
agentModel: {
type: String,
required: true,
default: 'gpt-3.5-turbo-16k'
},
type: {
type: String,
enum: Object.keys(DatasetTypeMap),
Expand Down
2 changes: 1 addition & 1 deletion packages/service/core/dataset/training/schema.ts
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ const TrainingDataSchema = new Schema({

try {
TrainingDataSchema.index({ lockTime: 1 });
TrainingDataSchema.index({ userId: 1 });
TrainingDataSchema.index({ datasetId: 1 });
TrainingDataSchema.index({ collectionId: 1 });
TrainingDataSchema.index({ expireAt: 1 }, { expireAfterSeconds: 7 * 24 * 60 });
} catch (error) {
Expand Down
2 changes: 2 additions & 0 deletions projects/app/public/locales/en/common.json
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@
}
},
"dataset": {
"Agent Model": "Learning Model",
"Chunk Length": "Chunk Length",
"Confirm move the folder": "Confirm Move",
"Confirm to delete the data": "Confirm to delete the data?",
Expand All @@ -259,6 +260,7 @@
"Delete Dataset Error": "Delete dataset failed",
"Edit Folder": "Edit Folder",
"Export": "Export",
"Export Dataset Limit Error": "Export Data Error",
"File Input": "Import File",
"File Size": "File Size",
"Filename": "Filename",
Expand Down
2 changes: 2 additions & 0 deletions projects/app/public/locales/zh/common.json
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@
}
},
"dataset": {
"Agent Model": "文件处理模型",
"Chunk Length": "数据总量",
"Confirm move the folder": "确认移动到该目录",
"Confirm to delete the data": "确认删除该数据?",
Expand All @@ -259,6 +260,7 @@
"Delete Dataset Error": "删除知识库异常",
"Edit Folder": "编辑文件夹",
"Export": "导出",
"Export Dataset Limit Error": "导出数据失败",
"File Input": "文件导入",
"File Size": "文件大小",
"Filename": "文件名",
Expand Down
13 changes: 5 additions & 8 deletions projects/app/src/constants/dataset.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { defaultQAModels, defaultVectorModels } from '@fastgpt/global/core/ai/model';
import type {
DatasetCollectionItemType,
DatasetItemType
Expand All @@ -17,13 +18,8 @@ export const defaultDatasetDetail: DatasetItemType = {
permission: 'private',
isOwner: false,
canWrite: false,
vectorModel: {
model: 'text-embedding-ada-002',
name: 'Embedding-2',
price: 0.2,
defaultToken: 500,
maxToken: 3000
}
vectorModel: defaultVectorModels[0],
agentModel: defaultQAModels[0]
};

export const defaultCollectionDetail: DatasetCollectionItemType = {
Expand All @@ -43,7 +39,8 @@ export const defaultCollectionDetail: DatasetCollectionItemType = {
name: '',
tags: [],
permission: 'private',
vectorModel: 'text-embedding-ada-002'
vectorModel: defaultVectorModels[0].model,
agentModel: defaultQAModels[0].model
},
parentId: '',
name: '',
Expand Down
2 changes: 2 additions & 0 deletions projects/app/src/global/core/api/datasetReq.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import type { SearchTestItemType } from '@/types/core/dataset';
import { UploadChunkItemType } from '@fastgpt/global/core/dataset/type';
import { DatasetCollectionSchemaType } from '@fastgpt/global/core/dataset/type';
import { PermissionTypeEnum } from '@fastgpt/global/support/permission/constant';
import type { LLMModelItemType } from '@fastgpt/global/core/ai/model.d';

/* ===== dataset ===== */
export type DatasetUpdateParams = {
Expand All @@ -14,6 +15,7 @@ export type DatasetUpdateParams = {
name?: string;
avatar?: string;
permission?: `${PermissionTypeEnum}`;
agentModel?: LLMModelItemType;
};

export type SearchTestProps = {
Expand Down
1 change: 1 addition & 0 deletions projects/app/src/global/core/dataset/api.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ export type CreateDatasetParams = {
tags: string;
avatar: string;
vectorModel?: string;
agentModel?: string;
type: `${DatasetTypeEnum}`;
};

Expand Down
6 changes: 3 additions & 3 deletions projects/app/src/global/core/prompt/agent.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
export const Prompt_AgentQA = {
prompt: `我会给你一段文本,{{theme}},学习它们,并整理学习成果,要求为:
1. 提出最多 25 个问题
2. 给出每个问题的答案
3. 答案要详细完整,答案可以包含普通文字、链接、代码、表格、公示、媒体链接等 markdown 元素
1. 提出问题并给出每个问题的答案
2. 每个答案都要详细完整,给出相关原文描述,答案可以包含普通文字、链接、代码、表格、公示、媒体链接等 markdown 元素
3. 最多提出 30 个问题
4. 按格式返回多个问题和答案:
Q1: 问题。
Expand Down
16 changes: 15 additions & 1 deletion projects/app/src/pages/api/admin/initv46-2.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import {
import { authCert } from '@fastgpt/service/support/permission/auth/common';
import { MongoDatasetData } from '@fastgpt/service/core/dataset/data/schema';
import { getUserDefaultTeam } from '@fastgpt/service/support/user/team/controller';
import { MongoDataset } from '@fastgpt/service/core/dataset/schema';
import { defaultQAModels } from '@fastgpt/global/core/ai/model';

let success = 0;
/* pg 中的数据搬到 mongo dataset.datas 中,并做映射 */
Expand Down Expand Up @@ -41,6 +43,13 @@ export default async function handler(req: NextApiRequest, res: NextApiResponse)

await initPgData();

await MongoDataset.updateMany(
{},
{
agentModel: defaultQAModels[0].model
}
);

jsonRes(res, {
data: await init(limit),
message:
Expand Down Expand Up @@ -76,14 +85,19 @@ async function initPgData() {
for (let i = 0; i < limit; i++) {
init(i);
}

async function init(index: number): Promise<any> {
const userId = rows[index]?.user_id;
if (!userId) return;
try {
const tmb = await getUserDefaultTeam({ userId });
console.log(tmb);

// update pg
await PgClient.query(
`Update ${PgDatasetTableName} set team_id = '${tmb.teamId}', tmb_id = '${tmb.tmbId}' where user_id = '${userId}' AND team_id='null';`
`Update ${PgDatasetTableName} set team_id = '${String(tmb.teamId)}', tmb_id = '${String(
tmb.tmbId
)}' where user_id = '${userId}' AND team_id='null';`
);
console.log(++success);
init(index + limit);
Expand Down
101 changes: 101 additions & 0 deletions projects/app/src/pages/api/admin/initv46-3.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import type { NextApiRequest, NextApiResponse } from 'next';
import { jsonRes } from '@fastgpt/service/common/response';
import { connectToDatabase } from '@/service/mongo';
import { delay } from '@/utils/tools';
import { PgClient } from '@fastgpt/service/common/pg';
import {
DatasetDataIndexTypeEnum,
PgDatasetTableName
} from '@fastgpt/global/core/dataset/constant';

import { authCert } from '@fastgpt/service/support/permission/auth/common';
import { MongoDatasetData } from '@fastgpt/service/core/dataset/data/schema';

let success = 0;
/* pg 中的数据搬到 mongo dataset.datas 中,并做映射 */
export default async function handler(req: NextApiRequest, res: NextApiResponse) {
try {
const { limit = 50 } = req.body as { limit: number };
await authCert({ req, authRoot: true });
await connectToDatabase();
success = 0;

jsonRes(res, {
data: await init(limit)
});
} catch (error) {
console.log(error);

jsonRes(res, {
code: 500,
error
});
}
}

type PgItemType = {
id: string;
q: string;
a: string;
dataset_id: string;
collection_id: string;
data_id: string;
};

async function init(limit: number): Promise<any> {
const { rows: idList } = await PgClient.query<{ id: string }>(
`SELECT id FROM ${PgDatasetTableName} WHERE inited=1`
);

console.log('totalCount', idList.length);

await delay(2000);

if (idList.length === 0) return;

for (let i = 0; i < limit; i++) {
initData(i);
}

async function initData(index: number): Promise<any> {
const dataId = idList[index]?.id;
if (!dataId) {
console.log('done');
return;
}
// get limit data where data_id is null
const { rows } = await PgClient.query<PgItemType>(
`SELECT id,q,a,dataset_id,collection_id,data_id FROM ${PgDatasetTableName} WHERE id=${dataId};`
);
const data = rows[0];
if (!data) {
console.log('done');
return;
}

try {
// update mongo data and update inited
await MongoDatasetData.findByIdAndUpdate(data.data_id, {
q: data.q,
a: data.a,
indexes: [
{
defaultIndex: !data.a,
type: data.a ? DatasetDataIndexTypeEnum.qa : DatasetDataIndexTypeEnum.chunk,
dataId: data.id,
text: data.q
}
]
});
// update pg data_id
await PgClient.query(`UPDATE ${PgDatasetTableName} SET inited=0 WHERE id=${dataId};`);

return initData(index + limit);
} catch (error) {
console.log(error);
console.log(data);
await delay(500);
return initData(index);
}
}
}
Loading

0 comments on commit cd3acb4

Please sign in to comment.