Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo the API video upload #1332

Open
Woseseltops opened this issue Sep 26, 2024 · 35 comments
Open

Redo the API video upload #1332

Woseseltops opened this issue Sep 26, 2024 · 35 comments
Assignees
Labels

Comments

@Woseseltops
Copy link
Collaborator

@rem0g suggested (in #1331) we redesign the video upload. I think the current workflow is:

  • Name video files after glosses, store them in a zip
  • Upload a zip file to a server somewhere and grab its url
  • Call /dictionary/upload_zipped_videos_folder_json/{datasetid}/ and provide this url
  • Call /dictionary/upload_videos_to_glosses/{datasetid} to actually link them to the glosses.

I want to replace this with just one step, to be run for each gloss invidually:

  • POST /dictionary/gloss/{gloss_id}/zippedvideo with just the zipped file in the payload.

I don't think uploading the videos one by one has any real drawbacks in terms of overload, and it does make the whole process easier to follow: you upload a video to a gloss, and if it fails somewhere in the process, it can just return what went wrong for that one gloss.

Also, @rem0g suggests to remove the h264 check:

it's not needed to check whether it's h264. If it doesnt play, it wont play and we will know that one way or another.`

@Woseseltops
Copy link
Collaborator Author

@rem0g , would this make your life better?

@susanodd , can you think of other drawbacks?

@susanodd
Copy link
Collaborator

susanodd commented Sep 26, 2024

I agree the check for h264 is excessive. It was more a check that the user didn't put say, PDF files in with videos.
But if we stick with "try-except" with friendly error reporting for when there was a problem, then this could be programmed away, so not to "check before" but just fail.

The zip uploads them to a different location because the normal upload of a video creates an object at the same time.

My opinion is that we should turn off reversion. (It will create a mess if it stays, because it does this every time you upload a video and then moves files around.)
I would just remove it entirely and just overwrite an existing video file.
Invest more time in saving the writable archive instead.

For the perspective videos, I made it so it just "overwrites" an existing 'left' or 'right' file and leaves the "GlossVideo object" alone. (So its link just points to a replaced file.)

@Woseseltops
Copy link
Collaborator Author

Summary of @susanodd : there are some extra things we should think about if we want to do this right.

@rem0g
Copy link
Collaborator

rem0g commented Sep 27, 2024

I would rather go this way:

  • Name video file after signbank ID (because of special chars of some glosses)
  • Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.
  • Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.
  • Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.
  • History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

This would definitely make my life (and for other developers easier) as we are going to upload max 6 videos per glos + 1 animation file.

@susanodd
Copy link
Collaborator

susanodd commented Sep 28, 2024

  • History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

This information is already maintained and is in the admin. We can make this visible via the dataset manager page.
Operations related to this would need to be added.
About the accidentally deleted videos, that is more difficult.
Only the paths are stored in the objects, not the actual video. You can merely see what the operation was. Not undo a delete.
That the video files are not stored in the database is evidenced by there not being any videos on the development servers, unless uploaded on them.

That's where "reversion" comes in. Reversion adds extra ".bak" to the end of filenames. And it's ".bak.bak.bak" extensions depending on how many deletes. Or it's version #, in the object model with 0, 1, 2, ... and those all get modified whenever a new video file is uploaded. (The code is inconsistent in doing this. Some of this is automated.) I don't like the code that does this at all. Moreover, if the user keeps retrying to upload a video, it keeps making more of these files...

  • Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.
  • Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.

Where it is uploaded to is a problem. Objects need to be created. At the moment there is also "reversion" as described above.

There is also a problem with transactions versus file system operations. If the file system hasn't finished writing an upload before the object is saved something goes wrong.
If the server gets overloaded with lots of file system operations the transactions will fail.
Sqlite locks the entire database for atomic operations.

Django writes to temp storage first. All the operations work this way.
(Some other issues are having trouble with this.)

@susanodd
Copy link
Collaborator

susanodd commented Sep 28, 2024

[IMPLEMENTATION]

Maybe it would be possible to split the video upload into two parts, so the "gloss video object" and "gloss video history" objects are created, but only dummy things in the file system, so it can be done fast.

Then couple a cron job that puts all the actual files to replace the dummies.
There could be a toggle field in the object to indicate whether it was a dummy or if its contents had been filled in.

@susanodd
Copy link
Collaborator

susanodd commented Oct 1, 2024

In #1329 the bug seems to be with symbolic links.

I fixed this. I rewrote the code to delete first then upload, instead of replace.
There is a setting in default.py that turns off deleting video files.
This needs to be turned on otherwise the code cannot delete any files even if the video is deleted in the admin.
This remains the same for normal videos.
I made new branches for NME videos and perspective videos since the code needs to delete the objects and their videos.

Hopefully this issue can remove the reversion code.

@Woseseltops
Copy link
Collaborator Author

A few responses to @rem0g :

Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.

For speed, but I wouldn't mind doing the raw video file instead.

Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.

Can you rephrase this? I'm unsure what you mean exactly here... how is this different from my original proposal?

Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.

I might be mistaken, but I think after the initial upload there are no other processing steps, now that the only check has been removed? That is, it's either uploaded to a gloss, or the upload failed?

History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

How about an endpoint GET dictionary/{gloss_id}/video_history, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video with an id to restore a video uploaded earlier?

@rem0g
Copy link
Collaborator

rem0g commented Oct 7, 2024

A few responses to @rem0g :

Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.

For speed, but I wouldn't mind doing the raw video file instead.

Videos are preprocessed already, but zipping video files slows down and compilates the process. Signbank has to download the provided URL and then unpack it, this can all be skipped by using simple video upload via POST form.

Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.

Can you rephrase this? I'm unsure what you mean exactly here... how is this different from my original proposal?
We have this:
url = "https://signbank.cls.ru.nl/dictionary/upload_zipped_videos_folder_json
and then
url = "https://signbank.cls.ru.nl/dictionary/upload_videos_to_glosses/5/"

That is exactly what your proposal is about, but it should be:

url = "https://signbank.cls.ru.nl/dictionary/upload_video/5/" + POST form

then the video is uploaded and binded at once, becasue it retrieves video file in binary + associated post forms.

Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.

I might be mistaken, but I think after the initial upload there are no other processing steps, now that the only check has been removed? That is, it's either uploaded to a gloss, or the upload failed?

SB does post status of other glosses in the response form when i upload entirely different glos video.

History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

How about an endpoint GET dictionary/{gloss_id}/video_history, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video with an id to restore a video uploaded earlier?

Ok

@susanodd
Copy link
Collaborator

susanodd commented Oct 7, 2024

Django uploads files to a temporary storage location before they are put in the proper locations for the dataset/glosses.
There is always this step.

@susanodd
Copy link
Collaborator

susanodd commented Oct 8, 2024

Because reversion is being used, Django potentially moves a bunch of video files whenever a new video is uploaded. It also creates objects for each upload A video upload is never just one operation.

@susanodd
Copy link
Collaborator

susanodd commented Oct 8, 2024

@Woseseltops there are two branches/pull requests now with migrations on the underlying video models.
Can you checkout this code in conjunction with this issue? #1328, #1338

@susanodd
Copy link
Collaborator

susanodd commented Oct 8, 2024

For restoring videos from the video history a better backup system is needed.
We could do the same as for gloss delete/restore and add the archived field.
[NOPE, that still depends on the file system underneath.]

My opinion is that reversion needs to be removed. It makes a mess of stuff. Even if we try to program around it, it is still there sometimes. The video history has the operations but there are no videos in it. And if you keep doing operations they keep showing up in the history because it also has unsuccessful (historical) operations stored there. (That is something to do with the file system, because if something failed the objects have already been stored in the database, but perhaps not the file system, and vice verse. Django does not wait on the file system. If there is a solution to this, specific to Django?)
(In the ajax calls with videos, the "user html that has no Django in it" needs to "explicitly buffer the file" before it can do the PUSH request. Otherwise the file is empty on the Django side.)
That's why I did a zip file with multiple videos in the Manage Media page. Then the user also sees the contents of their zip file and sees the videos being processed. I set a limit to the "process next 10 videos" because the server could not keep up making objects for the videos!!!!! Those video files are already "on" the Signbank file system. It's just video objects that are being created.)
We also ought to avoid unnecessary burden of the server with extreme amounts of url requests. (opinion)
Avoid thrashing.

@Woseseltops
Copy link
Collaborator Author

There's some confusion around this issue; I'm still not 100% sure @rem0g and I are on the same page, @susanodd isn't sure about traffic... so in an attempt to remove this confusion, I'm going to build a first draft for this and see what everybody thinks.

@Woseseltops Woseseltops self-assigned this Oct 10, 2024
@rem0g
Copy link
Collaborator

rem0g commented Oct 10, 2024

That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.

@susanodd
Copy link
Collaborator

susanodd commented Oct 11, 2024

That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.

The zip allows the special characters to remain inside the zip because you can give a simple ascii-based filename.
Signbank has video files with Chinese characters in them, for example.

In #1341 you have glosses with special characters in them. We are working on locating the bugs.

@rem0g
Copy link
Collaborator

rem0g commented Oct 11, 2024 via email

@susanodd
Copy link
Collaborator

No, every video should be named with signbank id to prevent issues with gloss renaming. Gomer

Yes, of course. But the zip file itself is not the same as the video file inside that includes the annotation in the filename.

The intent of using a zip file is to shield the "information exchange over internet" from the special characters.
Some of the (browser) code escapes the urls by putting "%" codes in them. The browsers do this.
Inside the zip file, the actual video files can have any name.
But in the past, some really old original signbank video code actually used those "%" in the filenames, then it won't match the annotation, obviously. (@Woseseltops wrote a lot of the original code, so hopefully he can investigate.)

Of course, we still need to figure out what is wrong with #1341.

@susanodd
Copy link
Collaborator

susanodd commented Oct 11, 2024

There probably needs to be totally different new code for the API videos.
Some of the original code still is trying to convert to mp4. (A function ensure_mp4 still exists. But most of the called functions have commented this out because it was broken. The code to make small videos is also commented out. The code to generate a "still" for the video has also been commented out in some places because there was a bug in it if the "length" of the video ends up changing when it is converted to frames and resized.) All of this code is or is not necessarily called during video upload. @Jetske and @susanodd have introduced new code to try to circumvent some old code. But the old code has not been removed. Most of it is to do with the things mentioned.)

I report here all the buggy things that are messed up with videos.
It's not always clear when code is being run or which code is being run.

@rem0g
Copy link
Collaborator

rem0g commented Oct 11, 2024

I agree the video upload needs a complete rewrite, ideally based on signbank ID as video name. This also goes for NMM and Voorbeeldzinnen videos, they all need ID's.

That way we have a better idea of what's going on instead of investigating problems in old codes which needs more people.

@Woseseltops
Copy link
Collaborator Author

Woseseltops commented Oct 11, 2024

Okay I got the external part working, I can now upload video to a Signbank server like this:

# The URL where the video file will be sent
url = 'http://35.159.203.216/dictionary/gloss/12/video'

# The path to your video file
video_file_path = 'glossvideo_ASL_FO_FORCE-834.mov'

# Auth
headers = {'Authorization': f'Bearer {BEARER_TOKEN}'}

# Open the video file in binary mode and send it in a POST request
with open(video_file_path, 'rb') as video_file:
    files = {'file': video_file}
    response = requests.post(url, files=files, headers=headers)

# Print the server's response
print(response.status_code)
print(response.text)

The internal processing of the file will follow probably next week.

Let me know what I can do to make it more convenient for you @rem0g ! From what you write, I'm thinking perhaps a generic video upload end point, so you're not forced to put the numeric ID of the gloss in the url?

@susanodd
Copy link
Collaborator

susanodd commented Oct 11, 2024

@Woseseltops can you take a look at the other issues about the videos? #1331 #1341 #673

The problem is that the database gets locked from the traffic.

@susanodd
Copy link
Collaborator

susanodd commented Oct 12, 2024

No, every video should be named with signbank id to prevent issues with gloss renaming. Gomer

I will make this be an option for the existing zip archive so you can use merely the ID as the filename. (The Manage Media page also uses the same format for the zip archive.)

You ought to put like 20 or 50 videos in the zip file. Not just one video. This will reduce traffic to the server.

The code makes a streamed-json result to include all the results for all the videos that are in the same zip file.

Based on the ID, it is already possible to just lookup the gloss to retrieve its default annotation and its lemma (first two characters) in order to construct the path.
@ocrasborn came up with using the annotation (without the ID) because normally the (non-API) user is not using the IDs.

@susanodd
Copy link
Collaborator

susanodd commented Oct 12, 2024

I posted many excerpts from the logs in the other issues.
You can see that the database is getting locked because of the traffic.
My opinion is still the same that the zip archive (with many videos) is better because it's less traffic.

We can modify the existing zip file upload request to put the file inside as a POST instead of as a URL parameter.
(I need help with that POST part.)

susanodd pushed a commit that referenced this issue Oct 15, 2024
susanodd pushed a commit that referenced this issue Oct 15, 2024
susanodd pushed a commit that referenced this issue Oct 15, 2024
susanodd pushed a commit that referenced this issue Oct 15, 2024
susanodd pushed a commit that referenced this issue Oct 15, 2024
@Jetske
Copy link
Collaborator

Jetske commented Oct 16, 2024

@rem0g I added api creation with video upload of voorbeeldzinnen here: https://signbank.github.io/Global-signbank/#/Adding%20Signbank%20data/post_dictionary_api_create_annotated_sentence__datasetid__
Can you see if this way of uploading files works for you? I think it would be good to at least handle all kinds of api video uploads in a consistent way.

@susanodd
Copy link
Collaborator

susanodd commented Oct 16, 2024

What @Woseseltops proposes is that the API uses a file name is the path to the file destination in the Signbank file system, using underscore where slash is.

#1347 (there are actually bugs in creating folders if they don't exist yet)

susanodd pushed a commit that referenced this issue Oct 16, 2024
susanodd pushed a commit that referenced this issue Oct 16, 2024
susanodd pushed a commit that referenced this issue Oct 16, 2024
susanodd pushed a commit that referenced this issue Oct 16, 2024
susanodd pushed a commit that referenced this issue Oct 16, 2024
susanodd pushed a commit that referenced this issue Oct 16, 2024
susanodd pushed a commit that referenced this issue Oct 16, 2024
@susanodd
Copy link
Collaborator

susanodd commented Oct 17, 2024

@rem0g suggested (in #1331) we redesign the video upload. I think the current workflow is:

  • Name video files after glosses, store them in a zip
  • Upload a zip file to a server somewhere and grab its url
  • Call /dictionary/upload_zipped_videos_folder_json/{datasetid}/ and provide this url
  • Call /dictionary/upload_videos_to_glosses/{datasetid} to actually link them to the glosses.

I want to replace this with just one step, to be run for each gloss invidually:

  • POST /dictionary/gloss/{gloss_id}/zippedvideo with just the zipped file in the payload.

I don't think uploading the videos one by one has any real drawbacks in terms of overload, and it does make the whole process easier to follow: you upload a video to a gloss, and if it fails somewhere in the process, it can just return what went wrong for that one gloss.

Also, @rem0g suggests to remove the h264 check:

it's not needed to check whether it's h264. If it doesnt play, it wont play and we will know that one way or another.`

@Woseseltops the zip file contains an archive of many videos. Not just one video.
It was intended to reduce traffic.
For example, if you upload a zip file with 20 videos, it only needs two url requests, not 40.

I am working on making the zip archive inside the POST instead of as a URL.
The original method gets the zip file from a URL because at the time, I had no idea how to put a zip file inside the POST.
Tim helped me do it as a URL because you were on vacation then.

@Woseseltops
Copy link
Collaborator Author

Response to @rem0g :

Yes perfect

I'm not sure if you're responding to my Python example or my (alternative) proposal below it, but to make sure I don't slow things down even further I have finished the end point as in the example and put it up for review. Let me know if you're also interested in having a generic endpoint that doesn't force you to put the gloss ID in the URL!

will there also be callback if the video upload has succeed and processed in the database directly from the URL?

This is example output from using the script above, does that answer your question?

400
{"error": "No file uploaded"}
200
{"message": "Uploaded video of size 156919 bytes to dataset NGT."}

If that will cause a long wait time I can adjust my scripts.

In my test cases it is relatively quick! There are worries on our side that adding this end point might generate a lot of parallel requests, which will lock our database (which was not designed for mass usage). We could implement rate limiting, but instead it's probably easier to ask you to put a few seconds between each request if possible :)

@rem0g
Copy link
Collaborator

rem0g commented Oct 17, 2024 via email

@Woseseltops
Copy link
Collaborator Author

Okay I think the only remaining item here is this idea

How about an endpoint GET dictionary/{gloss_id}/video_history, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video with an id to restore a video uploaded earlier?

I'm not sure what we have in terms of video history at all, so perhaps before I start: @rem0g , given approaching deadlines, do you still think it would make sense for me to work on this?

@susanodd
Copy link
Collaborator

  • You can browse the history in the admin to see whether it is useful.
  • When there were repeated attempts in the past because the video upload wasn't working , there were 74 entries for one of the DISNEY glosses.
  • Some filtering would be needed

@susanodd
Copy link
Collaborator

susanodd commented Nov 22, 2024

@Woseseltops

The video upload now makes use of the add_video command that makes use of reversion. This can go wrong. See #1374 and the other discussions trying to repair it.

It may be needed to convert a video if necessary to mp4 prior to performing gloss method add_video, and fail there if it cannot be converted. There was something going on with frame rates when it was failing.

If you look in the admin, there are 8 glosses that do not have video files for (NGT) GlossVideo objects.
It looks like they may have had a wrong format, or some other reason they are missing.

Some other videos with format "webm" are showing up in the Manage Storage page. That format is not visible/supported on Ubuntu or MacOS. The files are not being converted anymore because UvA did not want that for the API.

Gloss Video Files with non-mp4 Extensions

3 results found

Lemma (Nederlands) | Glos-ID | Annotatie (Nederlands) | Annotatie (Engels) | Video Paths
1-A | 2442 | 1-A | 1-A | 0: glossvideo/NGT/1-/1-A-2442.webm
1-B | 2896 | 1-B | 1-B | 0: glossvideo/NGT/1-/1-B-2896.webm
PT | 43805 | PT-Ahand | PT-Ahand | 0: glossvideo/NGT/PT/PT-43805.webm

Files that don't exist for the GlossVideo objects

ID 	 GLOSS 	 VIDEO FILE 	PERSPECTIVE 	NME 	FILE TIMESTAMP 	FILE GROUP 	PERMISSIONS 	FILE SIZE 	 VERSION
28330	OPEN-D	/var/www/writable/glossvideo/NGT/OP/OPEN-D-49062.mp4	False	False					0
28270	HALLOWEEN-A	/var/www/writable/glossvideo/NGT/HA/HALLOWEEN-A-48717.mp4	False	False					0
28253	IPAD-B	/var/www/writable/glossvideo/NGT/IP/IPAD-B-48656.mp4	False	False					0
28247	IPAD-A	/var/www/writable/glossvideo/NGT/IP/IPAD-A-48563.mp4.bak28247	False	False					1
28238	IPAD-C	/var/www/writable/glossvideo/NGT/IP/IPAD-C-48657.mp4	False	False					0
28193	IN-JE-TE-HEBBEN	/var/www/writable/glossvideo/NGT/IN/IN-JE-TE-HEBBEN-48108.mp4	False	False					0
22666	DOWNLOADEN-D	/var/www/writable/glossvideo/NGT/DO/DOWNLOADEN-D-48439.mp4	False	False					0
18917	AANDOEN	/var/www/writable/glossvideo/NGT/AA/AANDOEN-46565.mp4	False	False					0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants