GitHub - Hanpx20/Anchor

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
0.txt		0.txt
1.all.tsv		1.all.tsv
1.txt		1.txt
10.txt		10.txt
11.txt		11.txt
2.txt		2.txt
3.txt		3.txt
4.txt		4.txt
5.all.tsv		5.all.tsv
5.txt		5.txt
6.txt		6.txt
7.all.tsv		7.all.tsv
7.txt		7.txt
8.txt		8.txt
9.txt		9.txt
README.txt		README.txt

Repository files navigation

我从每个组中选取了200个Query的样例，通过人工观察判断它们所属的领域。
1, 5, 7我看的不太确定，因此把完成的Query都放上来了。

0：医疗，健康
1*：政治，生物学
2：电子产品
3：食物，菜谱
4：人物
5*：历史
6：产品广告
7*：旅游
8：新闻
9：公司，组织
10：高校
11：人物

在目前的几次训练中，Group Weight的分布大致如下：
0       1       2       3       4       5       6       7       8       9       10      11
0.0936	0.1970	0.0949	0.0674	0.0498	0.0668	0.0440	0.0368	0.1897	0.0393	0.0680	0.0527
0.0863	0.1515	0.0952	0.0674	0.0656	0.0742	0.0543	0.0506	0.1581	0.0525	0.0805	0.0638
0.0413	0.4236	0.0496	0.0240	0.0155	0.0237	0.0095	0.0044	0.3715	0.0036	0.0178	0.0154 (这一组没加scheduler)

从小到大的顺序：
7, 9, 6, 4, 11, 5, 3, 10, 0, 2, 8, 1
7, 9, 6, 11, 4, 3, 5, 10, 0, 2, 1, 8
9, 7, 6, 11, 4, 10, 5, 3, 0, 2, 8, 1

Tier 0：1，8
Tier 1：0，2
Tier 2：3，5，10
Tier 3：4, 11
Tier 4：6，7，9
I think an explanation is that topics with high complexity and no or little proper nouns are with higher weight.

目前的训练将不同组截断到了同样的长度，如果不截断，不同组的样本数量分别为（单位：万）：
0       1       2       3       4       5       6       7       8       9       10      11
34.2    49.2    27.9    30.2    18.1    50.2    43.4    56.1    32.8    39.2    27.8    33.4