Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latxe formual bug due to MathJax #3

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
95a3759
update
piglaker Jan 4, 2021
7c8914a
Update 文本分类.md
piglaker Jan 4, 2021
0e88a10
Update 文本分类.md
piglaker Jan 4, 2021
f498fb8
Update 文本分类.md
piglaker Jan 4, 2021
6734546
Update 文本分类.md
piglaker Jan 4, 2021
44ad684
Update 文本分类.md
piglaker Jan 4, 2021
5707bec
Update 文本分类.md
piglaker Jan 4, 2021
c3fe08b
Update 文本分类.md
piglaker Jan 4, 2021
00a0b43
Update 文本分类.md
piglaker Jan 4, 2021
eeddd5c
Update 文本分类.md
piglaker Jan 4, 2021
dd65673
Update 文本分类.md
piglaker Jan 4, 2021
2be1682
task1 not yet
piglaker Feb 19, 2021
dad66c8
Merge branch 'master' of https://github.com/piglaker/nlp-beginner
piglaker Feb 19, 2021
7217369
BOW 67% (unbalanced samples)
piglaker Feb 20, 2021
d2058e0
refresh
piglaker Feb 20, 2021
da0f4b6
refresh
piglaker Feb 20, 2021
8d18feb
refresh
piglaker Feb 20, 2021
259b2f9
2/21 cs224n assignment1
piglaker Feb 21, 2021
c06a754
调试和总结 重构了代码并删去了一些功能,之后用torch重写
piglaker Feb 22, 2021
d68380a
思考
piglaker Feb 22, 2021
8b8cb40
rewrite softmax_torch and test... even with NeuralNetwork...
piglaker Feb 22, 2021
cad88f0
task1 done!
piglaker Feb 22, 2021
64d057b
task1 over
piglaker Feb 22, 2021
1806dab
task1 over
piglaker Feb 22, 2021
3e6db1e
start
piglaker Feb 26, 2021
2b3c54a
buzhongyao
piglaker Feb 28, 2021
02f5367
2021/3/4
piglaker Mar 4, 2021
31293a1
data...
piglaker Mar 8, 2021
f9a02a7
2021年 3月 8日 星期一 13时03分07秒 CST
piglaker Mar 8, 2021
c31a520
2021年 3月 9日 星期二 10时26分53秒 CST
piglaker Mar 9, 2021
972e7a7
2021年 3月 9日 星期二 10时35分20秒 CST
piglaker Mar 9, 2021
f9d240d
2021年 3月 9日 星期二 10时58分03秒 CST
piglaker Mar 9, 2021
f9df2ab
2021年 3月 9日 星期二 13时31分55秒 CST
piglaker Mar 9, 2021
f8d6fd9
task2 LSTM Done!
piglaker Mar 9, 2021
c9e5a91
RNN Done!
piglaker Mar 11, 2021
54e3236
steal some time(2 h) finished TextCNN, so easy~
piglaker Mar 17, 2021
7424adb
task5 not test yet
piglaker Mar 18, 2021
05732f4
task5 almost done !
piglaker Mar 19, 2021
ed137c0
task5 done
piglaker Mar 19, 2021
8be20e6
esim not finished yet
piglaker Mar 19, 2021
7d3c053
esim not finished yet
piglaker Mar 19, 2021
f42650b
esim not finished yet
piglaker Mar 19, 2021
524ce9c
task3 esim finished ...
piglaker Mar 20, 2021
acaa00c
moring
piglaker Mar 21, 2021
0a24e02
task3 done!
piglaker Mar 21, 2021
801d8fc
ctf not finished
piglaker Mar 22, 2021
e2b95e2
task4 done!
piglaker Mar 22, 2021
ae63821
task4 done!
piglaker Mar 22, 2021
5ff8100
mod
piglaker May 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,27 @@
2. shuffle 、batch、mini-batch
6. 时间:两周

## Report
date:2021-2-19:2021-2-20
1.reference:
None
2.dataset:
train.tsv/test.tsv:
3.lib:
numpy;pandas;matplotlib
4.feature:
BOW;Ngram (high level 19422/136648)
5.details:
softmax;argmax;shuffle;batch;iteration;
6.result:
2000:epoch:10 acc:0.67
20000:epoch:10 acc:0.55 (loss no longer decreases while epoch grows)
7.conclusion:
problem 1:我认为softmax + fc学到的只是目标数据集的分布,用dataloader.check_dataset()实验发现数据集分布及其不均匀,而softmax训练后的结果也只是略高于最大的那一项,我认为19422*5这么多的参数面对复杂语言问题表现的极限就是猜出目标分布+一点点记忆(5%),调了很久参数,或是pytorch都很难得到收敛的loss曲线,并且我在尝试使用均匀化的数据集后,准确率最多达到了0.22左右,为了严谨我今晚会看一下别人的表现并用pytorch彻底地重写这部分,来验证我的观点,当然也可能是我的问题。
8.thinking:
之前有个小bug是BOW的维度我舍得过高,改了之后pytorch run了一下,能复现别人的0.8train 0.5 test的acc,不过感觉没啥意义,我在自己的dataloader上得到均匀的数据集来训练,overfitting,train很高,test很低。到此为止task1结束,很有限,模型学到的东西很少,只是在硬背罢了,Ngrams实现了但是没run,意义不大。


### 任务二:基于深度学习的文本分类

熟悉Pytorch,用Pytorch重写《任务一》,实现CNN、RNN的文本分类;
Expand Down Expand Up @@ -94,4 +115,7 @@
4. 知识点:
1. 语言模型:困惑度等
2. 文本生成
5. 时间:两周
5. 时间:两周


##### 完成
Binary file added a1/.DS_Store
Binary file not shown.
968 changes: 968 additions & 0 deletions a1/.ipynb_checkpoints/exploring_word_vectors-checkpoint.ipynb

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions a1/.vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"jupyter.jupyterServerType": "local",
"python.pythonPath": "D:\\anaconda3.5.0\\envs\\cs224n\\python.exe"
}
28 changes: 28 additions & 0 deletions a1/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Welcome to CS224N!

We'll be using Python throughout the course. If you've got a good Python setup already, great! But make sure that it is at least Python version 3.5. If not, the easiest thing to do is to make sure you have at least 3GB free on your computer and then to head over to (https://www.anaconda.com/download/) and install the Python 3 version of Anaconda. It will work on any operating system.

After you have installed conda, close any open terminals you might have. Then open a new terminal and run the following command:

# 1. Create an environment with dependencies specified in env.yml:

conda env create -f env.yml

# 2. Activate the new environment:

conda activate cs224n

# 3. Inside the new environment, instatll IPython kernel so we can use this environment in jupyter notebook:

python -m ipykernel install --user --name cs224n


# 4. Homework 1 (only) is a Jupyter Notebook. With the above done you should be able to get underway by typing:

jupyter notebook exploring_word_vectors.ipynb

# 5. To make sure we are using the right environment, go to the toolbar of exploring_word_vectors.ipynb, click on Kernel -> Change kernel, you should see and select cs224n in the drop-down menu.

# To deactivate an active environment, use

conda deactivate
14 changes: 14 additions & 0 deletions a1/env.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: cs224n
channels:
- defaults
- anaconda
dependencies:
- jupyter
- matplotlib
- numpy
- python=3.7
- ipykernel
- scikit-learn
- nltk
- gensim

1,328 changes: 1,328 additions & 0 deletions a1/exploring_word_vectors.ipynb

Large diffs are not rendered by default.

Binary file added a1/imgs/inner_product.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added a1/imgs/svd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added a1/imgs/test_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added a2/.DS_Store
Binary file not shown.
2 changes: 2 additions & 0 deletions a2/collect_submission.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
rm -f assignment2.zip
zip -r assignment2.zip *.py *.png saved_params_40000.npy
10 changes: 10 additions & 0 deletions a2/env.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: a2
channels:
- defaults
- anaconda
dependencies:
- jupyter
- matplotlib
- numpy
- python=3.7
- scikit-learn
15 changes: 15 additions & 0 deletions a2/get_datasets.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

DATASETS_DIR="utils/datasets"
mkdir -p $DATASETS_DIR

cd $DATASETS_DIR

# Get Stanford Sentiment Treebank
if hash wget 2>/dev/null; then
wget http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip
else
curl -L http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip -o stanfordSentimentTreebank.zip
fi
unzip stanfordSentimentTreebank.zip
rm stanfordSentimentTreebank.zip
75 changes: 75 additions & 0 deletions a2/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env python

import random
import numpy as np
from utils.treebank import StanfordSentiment
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import time

from word2vec import *
from sgd import *

# Check Python Version
import sys
assert sys.version_info[0] == 3
assert sys.version_info[1] >= 5

# Reset the random seed to make sure that everyone gets the same results
random.seed(314)
dataset = StanfordSentiment()
tokens = dataset.tokens()
nWords = len(tokens)

# We are going to train 10-dimensional vectors for this assignment
dimVectors = 10

# Context size
C = 5

# Reset the random seed to make sure that everyone gets the same results
random.seed(31415)
np.random.seed(9265)

startTime=time.time()
wordVectors = np.concatenate(
((np.random.rand(nWords, dimVectors) - 0.5) /
dimVectors, np.zeros((nWords, dimVectors))),
axis=0)
wordVectors = sgd(
lambda vec: word2vec_sgd_wrapper(skipgram, tokens, vec, dataset, C,
negSamplingLossAndGradient),
wordVectors, 0.3, 40000, None, True, PRINT_EVERY=10)
# Note that normalization is not called here. This is not a bug,
# normalizing during training loses the notion of length.

print("sanity check: cost at convergence should be around or below 10")
print("training took %d seconds" % (time.time() - startTime))

# concatenate the input and output word vectors
wordVectors = np.concatenate(
(wordVectors[:nWords,:], wordVectors[nWords:,:]),
axis=0)

visualizeWords = [
"great", "cool", "brilliant", "wonderful", "well", "amazing",
"worth", "sweet", "enjoyable", "boring", "bad", "dumb",
"annoying", "female", "male", "queen", "king", "man", "woman", "rain", "snow",
"hail", "coffee", "tea"]

visualizeIdx = [tokens[word] for word in visualizeWords]
visualizeVecs = wordVectors[visualizeIdx, :]
temp = (visualizeVecs - np.mean(visualizeVecs, axis=0))
covariance = 1.0 / len(visualizeIdx) * temp.T.dot(temp)
U,S,V = np.linalg.svd(covariance)
coord = temp.dot(U[:,0:2])

for i in range(len(visualizeWords)):
plt.text(coord[i,0], coord[i,1], visualizeWords[i],
bbox=dict(facecolor='green', alpha=0.1))

plt.xlim((np.min(coord[:,0]), np.max(coord[:,0])))
plt.ylim((np.min(coord[:,1]), np.max(coord[:,1])))

plt.savefig('word_vectors.png')
131 changes: 131 additions & 0 deletions a2/sgd.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
#!/usr/bin/env python

# Save parameters every a few SGD iterations as fail-safe
SAVE_PARAMS_EVERY = 5000

import pickle
import glob
import random
import numpy as np
import os.path as op

def load_saved_params():
"""
A helper function that loads previously saved parameters and resets
iteration start.
"""
st = 0
for f in glob.glob("saved_params_*.npy"):
iter = int(op.splitext(op.basename(f))[0].split("_")[2])
if (iter > st):
st = iter

if st > 0:
params_file = "saved_params_%d.npy" % st
state_file = "saved_state_%d.pickle" % st
params = np.load(params_file)
with open(state_file, "rb") as f:
state = pickle.load(f)
return st, params, state
else:
return st, None, None


def save_params(iter, params):
params_file = "saved_params_%d.npy" % iter
np.save(params_file, params)
with open("saved_state_%d.pickle" % iter, "wb") as f:
pickle.dump(random.getstate(), f)


def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False,
PRINT_EVERY=10):
""" Stochastic Gradient Descent

Implement the stochastic gradient descent method in this function.

Arguments:
f -- the function to optimize, it should take a single
argument and yield two outputs, a loss and the gradient
with respect to the arguments
x0 -- the initial point to start SGD from
step -- the step size for SGD
iterations -- total iterations to run SGD for
postprocessing -- postprocessing function for the parameters
if necessary. In the case of word2vec we will need to
normalize the word vectors to have unit length.
PRINT_EVERY -- specifies how many iterations to output loss

Return:
x -- the parameter value after SGD finishes
"""

# Anneal learning rate every several iterations
ANNEAL_EVERY = 20000

if useSaved:
start_iter, oldx, state = load_saved_params()
if start_iter > 0:
x0 = oldx
step *= 0.5 ** (start_iter / ANNEAL_EVERY)

if state:
random.setstate(state)
else:
start_iter = 0

x = x0

if not postprocessing:
postprocessing = lambda x: x

exploss = None

for iter in range(start_iter + 1, iterations + 1):
# You might want to print the progress every few iterations.

loss = None
### YOUR CODE HERE (~2 lines)

### END YOUR CODE

x = postprocessing(x)
if iter % PRINT_EVERY == 0:
if not exploss:
exploss = loss
else:
exploss = .95 * exploss + .05 * loss
print("iter %d: %f" % (iter, exploss))

if iter % SAVE_PARAMS_EVERY == 0 and useSaved:
save_params(iter, x)

if iter % ANNEAL_EVERY == 0:
step *= 0.5

return x


def sanity_check():
quad = lambda x: (np.sum(x ** 2), x * 2)

print("Running sanity checks...")
t1 = sgd(quad, 0.5, 0.01, 1000, PRINT_EVERY=100)
print("test 1 result:", t1)
assert abs(t1) <= 1e-6

t2 = sgd(quad, 0.0, 0.01, 1000, PRINT_EVERY=100)
print("test 2 result:", t2)
assert abs(t2) <= 1e-6

t3 = sgd(quad, -1.5, 0.01, 1000, PRINT_EVERY=100)
print("test 3 result:", t3)
assert abs(t3) <= 1e-6

print("-" * 40)
print("ALL TESTS PASSED")
print("-" * 40)


if __name__ == "__main__":
sanity_check()
Empty file added a2/utils/__init__.py
Empty file.
Binary file added a2/utils/__pycache__/__init__.cpython-37.pyc
Binary file not shown.
Binary file added a2/utils/__pycache__/gradcheck.cpython-37.pyc
Binary file not shown.
Binary file added a2/utils/__pycache__/utils.cpython-37.pyc
Binary file not shown.
Loading