Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss BEP021 derivatives for electrophys #5

Open
robertoostenveld opened this issue Feb 15, 2021 · 67 comments
Open

Discuss BEP021 derivatives for electrophys #5

robertoostenveld opened this issue Feb 15, 2021 · 67 comments

Comments

@robertoostenveld
Copy link
Collaborator

This continues the issue started here bids-standard/bids-specification#733. This BEP021 is a better place to continue the discussion, as we can also use projects and share files here which would not fit under bids-specification directly.

@robertoostenveld
Copy link
Collaborator Author

On 12 February 2021 we had a Zoom call to discuss the progress on BEP021 for which the draft is on http://bids.neuroimaging.io/bep021 on google docs. Some of the people that attended were @arnodelorme, @dorahermes, @guiomar, @jasmainak, @agramfort, but do not know the github handle of all. Please help to attend the others with a github presence by mentioning them here.

@robertoostenveld
Copy link
Collaborator Author

@jasmainak mentioned

The example dataset:

bids-standard/bids-examples#171
bids-standard/bids-examples#161

I haven't looked too closely but I see that it was merged by Robert :) Perhaps a dataset to use for discussion?

@robertoostenveld
Copy link
Collaborator Author

The pull requests bids-standard/bids-examples#171 and bids-standard/bids-examples#161 refer to “derivatives”, but looking at the dataset at https://github.com/bids-standard/bids-examples/tree/master/eeg_face13 there is nothing that sets it apart from a regular raw BIDS EEG dataset.

An example dataset that is a proper derivatives is https://github.com/bids-standard/bids-examples/tree/master/ds000001-fmriprep. Note, however, that there are quite some files in that derived dataset that have to be ignored as they are not standardized (yet).

Just some general pointers that relate to some topics we discussed:

On https://bids-specification.readthedocs.io/en/stable/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description it is specified that a derived dataset is also a BIDS dataset. The derived dataset has some required and some recommended extra elements in the dataset_description. When you prepare an example/draft derived ieeg/eeg/meg dataset, please keep these in mind.

Also, on https://bids-specification.readthedocs.io/en/stable/05-derivatives/01-introduction.html it states "Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing.” That lines up with what Scott and I were saying that a (derived) dataset should be (re)usable in its own right.

@robertoostenveld
Copy link
Collaborator Author

ping @adam2392 @hoechenberger

@robertoostenveld
Copy link
Collaborator Author

@guimar wrote:

Hi! I copy here the email too :)

Thanks everyone for the meeting today!

We thought that for the next meeting some of us could present some practical examples so we can have a starting point to discuss.
Preferably we could share the datasets in advance so everyone that has time can go over them and focus the discussion in the more problematic things.

Here's the link to the common derivatives specs merged in 1.4.0:
https://bids-specification.readthedocs.io/en/stable/05-derivatives/01-introduction.html
There, "derivative" is defined as:

Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing.

In line with what we have discussed.

Also note that there are some metadata fields to point to source data:
https://bids-specification.readthedocs.io/en/stable/05-derivatives/02-common-data-types.html

It's important to distinguish between two types of derivatives:

pre-processed (in essence they are similar to source data - datatype remains unchanged)
and processed (they are substantially different to source data - datatype changed)
So I think pre-processed derivatives can be easily accessible at this point (e.g. detrending, filtering, downsampling, re-referencing, etc).

In this logic I don't know where annotations may exactly fall. But they are also an important step (and I think that can be also interesting for other modalities and extensions).

Talk very soon!

@robertoostenveld
Copy link
Collaborator Author

and @sappelhoff wrote that

As far as I heard, @smakeig (Scott Makeig) @nucleuscub (Ramon Martinez-Cancino) also attended

@arnodelorme
Copy link
Collaborator

Thank you, Robert, for this summary. We will generate test datasets as discussed and then we should reconvene.

@guiomar
Copy link

guiomar commented Dec 1, 2021

Hi @sappelhoff @robertoostenveld @arnodelorme, @dorahermes, @jasmainak, @agramfort @hoechenberger @smakeig @nucleuscub @CPernet !

I want to retake the effort on ephys derivatives.

Having a new look at the document, I see there are 3 main blocks:

  1. Annotations
  2. Preprocessing (derivatives that doesn't change datatype)
  3. Procesing (new datatype)

I would incline to divide the work into independent lines, to avoid getting stuck due to the big amount of work ahead.
And I would start by dealing with: 2) pre-processing.
We have almost all the details needed to takle this one. I'm preparing some examples.

Would you like to meet and move this part forward?
https://www.when2meet.com/?13797922-eLdhm

@hoechenberger
Copy link

hoechenberger commented Dec 1, 2021

And I would start by dealing with: 2) pre-processing.

Preprocessing sounds good to me, I'm not sure about the point in the parens though ("derivatives that doesn't change datatype"). To me, preprocessing also includes epoching the data, which will create new data types too. But I'm also happy to just limit the next discussion to continuous data only (Maxwell-filtering, frequency filtering, …)

I will share some datasets we process using our BIDS pipeline shortly so you all can see how we're currently approaching things.

Cheers,
Richard

cc @agramfort

@guiomar
Copy link

guiomar commented Dec 1, 2021

Thanks @hoechenberger !!

I think this definition comes from the generic derivatives specification, but I'm not able to find it anymore.
Still it makes sense to differenciate between the two at this first approach, until we agree on the data format to be used when the datatype from the raw source is changed, to avoid getting blocked by this decision.
The annotations part is also more prone to debate. So let's focus on the more objective and straight forward!

This sounds awesome Richard! Thanks!

@adam2392
Copy link
Member

adam2392 commented Dec 1, 2021

I have read over the BEP021 derivatives and also added my availability, although I'll be in California at the time, so could possibly be hard to overlap.

I also utilized the "annotations" derivatives framework to create a dataset of event markings found in the iEEG via an automated algorithm. Specifically, these are "high-frequency oscillation" (HFO) onset/duration/channels stored as a tsv file. It works well for my use-case, but the tsv file does "explode" in length. Dataset is in dropbox link.

https://www.dropbox.com/sh/5ih5ay9fvo3q12s/AADBY5eDc_SmszHGyC3Mn6QJa?dl=0

@jasmainak
Copy link
Collaborator

I'm hammered that week. Unlikely to be able to join :(

I agree about the "explosion" issue that @adam2392 pointed out. I think I've seen it before. Probably it might help to limit the vocabulary of annotations so it is machine-readable not just machine-writeable.

@guiomar
Copy link

guiomar commented Dec 2, 2021

Hi @adam2392! Awesome! We can review the annotations with your example as well :)

@guiomar
Copy link

guiomar commented Dec 6, 2021

It seems the most preferred day to meet is Wed 15th Dec from 5 to 6pm CET.
I'll send you calendar invites with hangout link :)

@hoechenberger
Copy link

Thank you, @guiomar!

@guiomar
Copy link

guiomar commented Dec 6, 2021

For those who didn't receive the invitation and still wan to join, this is the link to hangouts:

bep021 - ephys derivatives
Wednesday, 15 December · 17:00 – 18:00
Google Meet joining info
Video call link: https://meet.google.com/ccc-uzuc-nyg
Or dial: ‪(US) +1 716-249-4224‬ PIN: ‪807 934 263‬#

@guiomar
Copy link

guiomar commented Dec 16, 2021

Thank you all who joined yesterday!
@hoechenberger @robertoostenveld @adam2392 @smakeig @tpatpa

I would like to sumarize here some of the main points we discussed in the meeting:

1) Annotations:

2) Preprocessing steps:

I have dedicated some time to reorganize the BEP021 documentation accordingly, since it was becoming very messy:
https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/

Please, add any other point I may have forgoten and you considered important :)

We planned to do another meeting in January to show some examples and continue further discussing the remaining issues.
If you are interested, you can mark your availabilities here:
https://www.when2meet.com/?13923648-HOHfD

Talk soon!

@smakeig
Copy link

smakeig commented Dec 18, 2021 via email

@guiomar
Copy link

guiomar commented Dec 20, 2021

Thnaks @smakeig! I can't see any attachment, maybe it's easier if you share the links?

@smakeig
Copy link

smakeig commented Dec 20, 2021 via email

@guiomar
Copy link

guiomar commented Jan 10, 2022

Hello @sappelhoff @robertoostenveld @arnodelorme, @dorahermes, @jasmainak, @agramfort @hoechenberger @smakeig @nucleuscub @CPernet @adam2392 @tpatpa!

We planned to do another meeting these days, if you are interested, you can mark your availabilities here:
https://www.when2meet.com/?13923648-HOHfD

@guiomar
Copy link

guiomar commented Jan 11, 2022

Thanks a lot! Let's make it Tuesday 25th January at 1:00pm CET
I'll send and invitation shortly

@guiomar
Copy link

guiomar commented Jan 11, 2022

Details for joining:

bep021 - ephys derivatives
Tuesday, 25 January · 13:00 – 15:00
Google Meet joining info
Video call link: https://meet.google.com/ccc-uzuc-nyg
Or dial: ‪(US) +1 716-249-4224‬ PIN: ‪807 934 263‬#

@robertoostenveld
Copy link
Collaborator Author

Let me repost the content of an email here that I already sent to the invitees of the most recent BEP021 meeting. I'll reformat it slightly. Having the discussion here rather than via email keeps it better in the open for everyone to follow and contribute.

Following the discussion of 25 Jan, in which we updated the BEP021 google doc to indicate what is in and out of scope, I have worked on some example pipelines and derivatives corresponding to sections 6.2, 6.3, 6.4, 6.5 and 6.6 in the google doc.

I started with doi:10.18112/openneuro.ds003645.v1.0.0, downloaded it (partially) and wrote a script to make a selection. I do not consider this part of the pipeline yet (although it could have been), so my starting point is “ds003645_selection” (selection code included).

Starting from “ds003645_selection", I ran the following pipelines

  • filtering
  • downsampling (without explicit filtering, i.e. startting from ds003645_selection)
  • rereferencing
  • filtering, followed by downsampling, followed by rereferencing (as three pipelines)
  • filtering and downsampling and rereferencing (as one preproc pipeline)

This results in 6 derivatives (indicated in bold, also below). The code for each is in the respective “code” directory. Note that there are more lines of code needed for data handling than for the actual pipeline (which I implemented with FieldTrip).

The resulting directory tree (only showing directories, not the files) looks like this (see below) and can be browsed on google drive (no login needed, view only).

ds003645_selection
├── code
├── derivatives
│   ├── downsampling
│   │   ├── code
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   ├── filter_and_downsample_and_rereference
│   │   ├── code
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   ├── filtering
│   │   ├── code
│   │   ├── derivatives
│   │   │   └── downsampling
│   │   │       ├── code
│   │   │       ├── derivatives
│   │   │       │   └── rereference
│   │   │       │       ├── code
│   │   │       │       ├── sub-002
│   │   │       │       │   └── eeg
│   │   │       │       ├── sub-003
│   │   │       │       │   └── eeg
│   │   │       │       └── sub-004
│   │   │       │           └── eeg
│   │   │       ├── sub-002
│   │   │       │   └── eeg
│   │   │       ├── sub-003
│   │   │       │   └── eeg
│   │   │       └── sub-004
│   │   │           └── eeg
│   │   ├── sub-002
│   │   │   └── eeg
│   │   ├── sub-003
│   │   │   └── eeg
│   │   └── sub-004
│   │       └── eeg
│   └── rereference
│       ├── code
│       ├── sub-002
│       │   └── eeg
│       ├── sub-003
│       │   └── eeg
│       └── sub-004
│           └── eeg
├── sub-002
│   └── eeg
├── sub-003
│   └── eeg
└── sub-004
    └── eeg 

As discussed yesterday, your review of these examples can be used to get concretre ideas what needs to be done to extend the specification. Note that AFAIK I have now created derivative datasets that are compliant according to BIDS version 1.6.0. I did not run the validator (as it does not work yet on derivatives).

Some known issues at the moment I created the derivatives:

  • README and CHANGES are not updated
  • Some of the metadata fields have suboptimal values (e.g. because the derived datasets are only on my disk)
  • the eeg.json metadata is very minimal

@robertoostenveld
Copy link
Collaborator Author

@arnodelorme wrote in a reply to my email

I feel for reproducibility purposes, we should indicate the version of MATLAB (and Fieldtrip) in the script. Also, the script could be named with a standard name (like derivative.m) to avoid confusion if there are several scripts in the folder.

For the script:

  • Add the DOI of the parent BIDS datasets (as a comment). There can be several snapshots for a dataset so the name only is not enough. NOTE THIS MIGHT BE A MAJOR FLAW OF THE CURRENT DERIVATIVE FORMAT AS THE “SourceDatasets” field of “dataset_description.json” does not allow to uniquely identify the dataset.
  • Add the version of MATLAB as a comment
  • Add the version of Fieldtrip as a comment

This could be an optional recommendation for people, so we maximize reproducibility. Or we could imagine a specific JSON file for that in the code folder (derivative.json).

I would not cascade BIDS derivative repositories if there is no intention of assigning a DOI to each one. In that case, (downsampling plus filtering), then create a single dataset that is both downsampled and filtered. This could avoid the proliferation of datasets and confusion.

Alternatively, we could assign a DOI to a cascaded BIDS dataset (so parent folder plus derivative folder plus the derivative of the derivative folder). However, adding a derivative folder means that we have to create a new DOI for the whole cascaded BIDS dataset, which does not seem like a good idea.

Also, we should consider in the derivative specification folders of name derivative, derivative2, derivative3, etc.…

Note the proposed architecture is opposite the behavior of the cfg var in Fieldtrip, where parents BIDS are stored in substructure cfg.cfg (unless I am mistaken). In BIDS, parent BIDS datasets are stored in the parent folder. I personally prefer when parents are stored in a subfolder (more intuitive). Also, this would avoid the problem of having multiple “derivative” folders (a dataset usually has a single parent, but a parent can have multiple children). That's a broader discussion, of course.

@robertoostenveld
Copy link
Collaborator Author

robertoostenveld commented Jan 28, 2022

Looking at the nested derivatives data structure that I created, I realize two aspects. These are more fundamental that the discussion on which specific metadata fields are needed (like matlab and fieldtrip version, to which I agree).

  1. Do we want to store provenance (filter settings etc.) along with each individual derivatives data file (i.e. duplicated), or do we want to store it along with the information about the pipeline? Note that if we were to store it with each datafile in the eeg.json, we could make use of inheritance and deduplicate it at the top level.
  2. Do we want to store the provenance of the previous steps along with every derivative? This touches upon the data.cfg.previous.previous.previous strategy used in FieldTrip.

Regarding 1: it makes sense (and is needed) if each file were processed differently. But along the processing pipeline we usually make raw data that might be inconsistent at the individual participant level more and more consistent. If the same pipeline is used on all datafiles, then documenting the pipeline metadata once should suffice. Note that documenting it along with the data is like data.cfg.previous.previous.previous, whereas documenting it with (or in) the code is more similar to having the script available (e.g. on github, in the code directory, or as documented in the GeneratedBy metadata field).

Regarding 2: Assuming that we would only store provenance of the last step, then in my example the metadata of ds003645_selection/derivatives/rereference could not be distinguished from the metadata of ds003645_selection/derivatives/filtering/derivatives/downsampling/derivatives/rereference, as both have the same step as the last. Although @arnodelorme mentions that my sequential cascade of pipelines is not optimal w.r.t. .data handling (and I agree), the principle is that pipelines can be cascaded. My cascade should therefor be considered as an example that should be supported if a researcher would want to do so.

My item 2 also relates to what Arno mentions w.r.t. pointing in the derivative to its source. He discusses it in relation to DOIs and hence published/persistent datasets, whereas I was in creating the examples not thinking about publishing and hence more thinking about referencing the local (on my hard disk or my labs network disk) relation between datasets. This also relates to PR bids-standard/bids-specification#820 and bids-standard/bids-specification#821.

I think we all have some implicit expectation about how people (including ourselves) work and when to start a new derivative, or when to "weave" additional files into an existing derivatives. In general my directory organization looks like this

projectId/raw       # might be a symbolic link to data on another volume
projectId/code      # ideally also on github for version control and collaboration 
projectId/results

and as long as I keep on working on the pipelines in the code directory, the intermediate and final files continue to go into the results directory (which has some internal directory structure similar to BIDS). Once the analysis is finalized, I would clean up the code and results (prune dead branches, rename files for improved consistency, possibly rerun it all once more to ensure consistency and reproducibility) and consider publishing the results+code directories together. Those would then comprise the BIDS derivative. An example is this with raw and derivatives data, plus a copy of the code on github (a static version of the code is also included in the derivative).

The example that I prepared is however at a larger collaborative (and more FAIR) scale, where Daniel and Rik prepared the initial multimodal data ds000117, which was then handled by Dung and Arno, resulting in ds003645, which was handled by me resulting in ds003645_selection, which was then handled by the "filtering guy" resulting in derivatives/filtering, which was then handled by the "downsampling guy", etc, etc. That leads me to the question: which information do we want to be present at which level (e.g. is the EEG amplifier brand still relevant after 5 stages of processing?), and which information do we expect re-users of a dataset to look up at with the ancestor dataset (for which those need to be accessible).

@agramfort
Copy link

agramfort commented Feb 2, 2022

I have not looked in details but browsing the google drive I see that the entity desc is used while I would have imagine proc is more natural. But it's not a strong feeling here. The derivatives > derivatives > nested structure is coherent but maybe hard to navigate. If you have 6 steps of processing (maxfilter, temporal filtering, artifact rejection, epochs, averaging ...) it's a lot and I would have considered putting all the produced files in the same folder of derivatives for a given subject. With the proposal here is what I say a valid option? thx @robertoostenveld for the coordination

@robertoostenveld
Copy link
Collaborator Author

robertoostenveld commented Feb 2, 2022

From the spec (emphasis mine): "The proc label is analogous to rec for MR and denotes a variant of a file that was a result of particular processing performed on the device. This is useful for files produced in particular by Elekta’s MaxFilter (for example, sss, tsss, trans, quat or mc), which some installations impose to be run on raw data because of active shielding software corrections before the MEG data can actually be exploited."

The existing entities res/den/label/desc are specifically applicable to derivative data, and desc is used "[w]hen necessary to distinguish two files that do not otherwise have a distinguishing entity...", which I why I used it with the short description of the pipeline.

My example pipelines only produced a single result per raw input file; the sequential application served to have us think about what happens if we pass derivatives around between each other or on openneuro. If you have a pipeline that produces multiple results (which also makes sense), then those can be placed next to each other. I imagine that could result in

sub-002_task-FacePerception_run-1_desc-pipelinenameresult1_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult2_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult3_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinenameresult4_eeg.eeg + vhdr + vmrk

where the description (hence desc) pipelinenameresultN combines both the pipeline name and the specific result. However, you could also imagine

sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result1_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result2_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result3_eeg.eeg + vhdr + vmrk
sub-002_task-FacePerception_run-1_desc-pipelinename_xxx-result4_eeg.eeg + vhdr + vmrk

where xxx codes another to-be-defined entity to split the pipeline from the result.

I don't think that we will benefit from long file names with specific entities for specific pipelines (such as filt-lp and filt-hp for different types of filters, or ref-avg and ref-common for different rereferencing schemes). The entity list is already very long and if we want to have a unique entity for each step, combined with prescribed values for the label, the specification would have to become very elaborate. I rather think that we would benefit from a simple desc-pipelinenameresultN where we would specify a very clear/explicit mapping of pipelinenameresultN to a metadata field inside the corresponding JSON. A bit similar to how we have the required json metadata field TaskName for functional data (which maps onto task-<TaskName> in the filename), combined with the recommended TaskDescription metadata field. It could then be DescName and DescDescription (which sounds silly) or so. However, I think the existing GeneratedBy also serves the purpose just fine.

An important difference between GeneratedBy and extending the filename of individual files (and/or the JSON metadata of individual files) is whether we document the pipeline once in dataset_description.json, or many times for each resulting file.

@robertoostenveld
Copy link
Collaborator Author

Yesterday we have a BIDS steering group meeting (@guiomar also attended) and it was mentioned there that BEP028 is making good progress. I will study that, you might want to look at that as well.

Furthermore, we also shortly touched upon the two tangential motivations for BIDS in making the results of analysis replicable (e.g. be able to recompute) and making raw or derived data reusable (for follow-up analyses). The first requires extensive details, the second can also be achieved with minimal metadata.

Also (as I was reminded in the meeting yesterday), the overarching BIDS strategy is to keep things as simple and small as possible, and we consider the 80/20 pareto principle.

@SaraStephenson
Copy link

SaraStephenson commented Mar 9, 2022

Hello All,

With the help of @CPernet and @jesscall, @jadesjardins and I have prepared an example of the Face13 dataset with annotations stored in .tsv files and described in .json files. This current example is for discussion surrounding how to store continuous time annotations.

These files are located within the bids-examples/eeg_face13/derivatives/BIDS-Lossless-EEG/sub-*/eeg folders.

The annotations in this example were produced by the EEG-IP-L pipeline. There are several different types of annotations from this pipeline, including channel annotations, component annotations, binary time annotations and non-binary (continuous) time annotations.

The EEG-IP-L pipeline currently produces an annotations.json, annotations.tsv, and annotations.mat file. The .json describes all of the pipeline annotations. The .tsv contains the channel, component and binary time annotations. The .mat file contains the continuous time annotations. Since the .mat file is not a part of the BIDS specification, this current example has added a ‘recording-marks_annotation.tsv.gz’ and an accompanying 'recording-marks_annotation.json' for continuous time annotations. The 'recording-marks_annotation.tsv.gz' and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings.

If we are to store continuous time annotations in a tsv file, one concern we have is the need for two annotations tsv files because the non-binary time annotations are stored differently than the binary time annotations, component annotations, and channel annotations. As all of these annotation types are important for the EEG-IP-L pipeline, we are looking forward to some suggestions around how they can best be stored in BIDS.

Thanks,
Sara Stephenson

@robertoostenveld
Copy link
Collaborator Author

Thanks @SaraStephenson!

To help others that want to look at it on their own computer: I just did this to get the changes (which are on a branch that is 160 commits behind and 1 commit ahead of HEAD)

cd  bids-examples
git checkout master
git pull origin master # where origin points to [email protected]:bids-standard/bids-examples.git
git checkout -b BUCANL-bep021_ephys_derivatives
git checkout 2105c6d23eddd657e2305fbb5181242c1b1b1545 # go back in time, to avoid conflicts elsewhere
git pull --ff-only [email protected]:BUCANL/bids-examples.git bep021_ephys_derivatives

@CPernet
Copy link
Collaborator

CPernet commented Mar 10, 2022

YES Robert! thx -- I'll also have a look as well
now of course raw or derivatives - even if pushed in bep21, we went for raw here

@robertoostenveld
Copy link
Collaborator Author

@SaraStephenson Let me comment on what I encounter while going through the data.

first in derivatives/BIDS-Lossless-EEG

dataset_description.json

  • even though it is now nested in a dataset, I would prefer SourceDatasets still to be specified with a DOI.
  • I cannot find version 0.1.0 Alpha, better would be to either tag that and/or make a github release, or use a git SHA as version

README and README.md are duplicates.

The LICENSE file applies to the code, but seems inappropriate to the data, i.e. it is not a data use agreement. The source eeg_face13 is ODbL, that also applies to derivatives (since share-alike). I recommend to add the license not only as a file, but also to dataset_description.json. Perhaps you want to move the LICENSE file to the code directory.

Rather than linking to https://jov.arvojournals.org/article.aspx?articleid=2121634 I recommend to link to https://doi.org/10.1167/13.5.22

The file eeg_face13/task-faceFO_events.json can be an empty object {} but not an empty list []. Better would be if it were to explain a bit about the events.tsv files.

The electrode files are identical for all subjects; that suggests that they are not measured but a template. It is not recommended to add template data to the individual subjects. If you want to add a single template to all subjects, better put it at the top level (i.e. following the inheritance principle).

The IntendedFor path is inconsistent between sub-001_task-faceFO_annotations.json and sub-001_task-faceFO_desc-qc_annotations.json (one has ./ in front, the other not).

It is not clear to me what the difference is between sub-001_task-faceFO_annotations.tsv and sub-001_task-faceFO_desc-qc_annotations.tsv.

I don't think that the SamplingFrequency field in sub-002_task-faceFO_annotations.json is needed. The corresponding TSV file is not expressed in samples, but in seconds.

I don't think that EDF is the optimal binary format for processed EEG data. EDF is limited to 16 bits, whereas the data was recorded with 24 bit (since Biosemi) and subsequently processed as single or even double precision. I recommend writing to the BrainVision format, that allows single precision floats to be represented. Or to EEGLAB .set.

The way you coded two things in the annotations.tsv files to me appear to be nearly orthogonal and not relating to each other: the first few rows (with chan_ and comp_) don't relate to onset and duration, and the latter rows don't relate to channels. Each row has a label, but the chan_xxx and comp_xxx labels appear to be very different from all others. Would it not be better to have those in two TSV files? Or possibly even three: desc-chan_annotations.tsv, desc-comp_annotations.tsv and desc-task_annotations.tsv file?

I am not sure (cannot check, since zero bytes) what is in the mat files.

There is a sub-001_task-faceFO_recording-marks_annotation.json file with a data dictionary, but no corresponding data. I would expect that to come with a TSV file (even when it would be empty, i.e. only with the first header row).

Not related to the derivative, but I noticed a typo in eeg_face13/sub-003/eeg/sub-003_task-faceFO_eeg.json: McMaster Univertisy rather than University.

Moving on to BIDS-Seg-Face13-EEGLAB:

Again duplicate README files.

Again PipelineDescription.Version being unfindable on github. Also here the dataset_description could contain more info. There is no license (should be ODbL, since derivatives from the original data should be share-alike).

I cannot review anything else at this level any more (since only binary files), which is not a problem per see.

@arnodelorme
Copy link
Collaborator

arnodelorme commented Mar 10, 2022 via email

@SaraStephenson
Copy link

Thank you for all your comments about the Face13 example @robertoostenveld, I will look into making the appropriate corrections. I want to provide some clarifying information about annotations in the Face13 example dataset so that the discussion about how to best store the different types of annotations (component, channel, binary time and continuous time annotations) in BIDS can continue.

The sub-001_task-faceFO_annotations.tsv file contains the annotations that were produced by the EEG-IP-L pipeline. The sub-001_task-faceFO_desc-qc_annotations.tsv file contains annotations after the manual quality control (QC) procedure has been completed. During the QC procedure, the reviewer can modify some time and component annotations (particularly the ‘manual’ mark) based on visual inspection of the data.

The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc.

I have a few concerns about storing the chan, comp, and time annotations in separate files. One concern is that this will result in a large number of annotation files considering there would also be multiple versions of each file (one for the EEG-IP-L pipeline output and at least one for the QC’ed (desc-qc) data). Another concern is the naming of these annotation files. Currently we use desc-qc to indicate if the annotations are associated with QC’ed data, but would this complicate naming the different types of annotation files with desc-chan, desc-comp and desc-task?

The .mat file contains the continuous time annotations (such as the AMICA log likelihood). Since the .mat file is not a part of the BIDS specification, this current example has added a recording-marks_annotation.tsv.gz and an accompanying recording-marks_annotation.json for continuous time annotations. The recording-marks_annotation.tsv.gz and the .json file were created based on the BIDS spec for storing physiological and other continuous recordings. The recording-marks_annotation.tsv.gz in the Face13 example contains 100 rows for each of the annotations listed in the recording-marks_annotation.json. These new files were created because the continuous time annotations can not be stored in the same way the component, channel, and binary time annotations are currently stored.

Hopefully this example can help move the discussion on how to store annotations (particularly continuous time annotations) in BIDS forward.

Thanks,
Sara

@smakeig
Copy link

smakeig commented Mar 24, 2022 via email

@robertoostenveld
Copy link
Collaborator Author

I'd suggest treating the AMICA likelihood index as a derived data channel time sync'ed with the original data channels

Continuous data that is time synched with other data is already part of the BIDS specification and documented here https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/06-physiological-and-other-continuous-recordings.html. A very similar approach (again with TSV files) is used for PET blood recording data.

@arnodelorme
Copy link
Collaborator

arnodelorme commented Mar 25, 2022 via email

@jesscall
Copy link

Hi BEP021 community, I'm looking to move forward on a few points from @SaraStephenson 's thread above:

The formatting of our current annotations.tsv files (that contains component, channel, and binary time annotations in one file) are based on a combination of Examples 2 and 3 in Section 5.1: Sidecar TSV Document in the BEP 021 google doc.

[...]
To store them in separate files, a few questions/concerns pop up:
a. This creates a large number of annotation files - multiplied by pipeline (EEG-IP-L) and QC outputs (desc-qc).
b. Naming all these files - Currently we find it Very useful to use desc-qc to indicate the annotations are associated with QC'ed data, but would this be possible with the convention of desc-chan, desc-comp and desc-task?

@robertoostenveld What are your thoughts on this? Should this be revisited?

We'd like to avoid unnecessary complexities in file naming, and Sara's example follows examples 2 and 3 of BEP021 5.1: Sidecar TSV.

...

Second,

I wonder if it would be productive to call what you refer to as 'continuous time annotations' as, rather, 'continuous time data measures' -

@smakeig thank you -- calling them "measures" rather than annotations can address this nicely, and works with @robertoostenveld and @CPernet's prior comments on storing in TSV files as continuous recordings. see spec here in previous comments.

@SaraStephenson, I think we'll try moving away from "annotations" - perhaps recording-marks_amica.tsv.gz could work?

@dorahermes
Copy link
Member

I agree with differentiating continuous time measurements versus event based annotations with an onset and duration.

@tpatpa and I are making some updates to the examples for event based annotations with an onset and duration that use HED/SCORE tags.

@smakeig
Copy link

smakeig commented May 10, 2022 via email

@jesscall
Copy link

Hi all!

@robertoostenveld @CPernet @arnodelorme
I'm hoping to move the discussion forward for EEGNet for the first point in this comment. We'd appreciate your thoughts on avoiding file splitting to this extent.

Thanks!

CC @SaraStephenson @christinerogers

PS: @smakeig we're happy to workshop the naming of those TSV files - you're right that marks still suggests an individual time marker.

@arnodelorme
Copy link
Collaborator

It is unclear to me how the annotation file differs from the event.tsv file.

@CPernet
Copy link
Collaborator

CPernet commented Jun 2, 2022

Have you tried @robertoostenveld proposal for the continuous measurement/annotation?

@jesscall
Copy link

jesscall commented Jun 2, 2022

Have you tried @robertoostenveld proposal for the continuous measurement/annotation?

@CPernet
Yes, @SaraStephenson has already committed the suggested changes here.

Can we go ahead and make the necessary adjustments to the BEP021 spec to reflect this suggestion? This paragraph in section 5.1 of the doc still proposes the alternative method of creating a synthetic data channel. Perhaps it can be updated with information on storing continuous recordings as described here?

@robertoostenveld
Copy link
Collaborator Author

It is unclear to me how the annotation file differs from the event.tsv file.

I don't think they technically differ (except for the name), but conceptually I see it like this

  • events is for the observable things that happened during acquisition independent of the EEG data (like stimuli, button presses) and that could also have happened and be observed if the EEG were not recorded at all
  • annotations is for things that are observed specifically in the EEG data and hence requires (re)viewing the data

For many things this works, although I realize that there are some situations where it might not be clear. But as long as it covers 80% of the use-cases, I think it is useful. No need to get 99.99% coverage.

@jesscall
Copy link

jesscall commented Jun 8, 2022

Hi all,

I made a suggestion in the BEP021 document in order to reflect the discussion in this thread (RE: continuous annotations / data measures). The suggestion is based on feedback from @CPernet and @robertoostenveld.

Please have a look and let me know if further amendments are needed.

@smakeig
Copy link

smakeig commented Jun 8, 2022 via email

@Andesha
Copy link

Andesha commented Jun 16, 2022

Hello community!

It's been awhile since I have been involved in these discussions, but I am once again contributing to some BIDS EEG efforts on behalf of EEGNet.

I've taken some time to read over this thread and want to state here as part of my post what I believe to be the current understanding of annotations so that I can update Face13 (and its various example repositories) as necessary once again.

If there's anything wrong with my broad statements of the current efforts, please let me know and hopefully it can restart some discussion here.

  • @robertoostenveld's definition here of events being acquisition time versus annotations being specific data properties after recording is what sets the two apart.
  • The AMICA log likelihood is the only "continuous annotation" being put forward by Face13, and isn't used for segmentation, and rarely for decision making during review. As such, it is being handled similar to eye tracking data and kept an external zipped tsv.
    • Open to discussion from others if the names of these files need any review.
  • Annotations are currently (in Face13) being kept all in one file so as to not overpopulate the derivative folders. In the case of the Lossless Pipeline, a dozen or more could be created. I believe having one larger file as opposed to numerous small ones benefits the standard best.
  • @smakeig raised the point of if these are just additional "flags", why are we using the term "annotations". I have personally inherited the term from working with @jadesjardins. If what has previously been discussed as "continuous annotations" become standardized to be stored much like eye tracking data, I feel like this may be a positive change to broaden understanding of what their use is at a first glance.
    • Changing this may require broader community discussion, the name may have stuck in some places...

I'll be coordinating offline with @SaraStephenson and @jadesjardins as necessary for changes to Face13, so no worries. I think I own a few of the commits too...

CCing @jesscall and @christinerogers as well.

Thanks!

@CPernet
Copy link
Collaborator

CPernet commented Jun 19, 2022

we can raise those points at OHBM2022

  • @jesscall please make the suggestion on the google doc
  • @.all I think annotations as in continuous annotation via tsv.gz works great
  • any other 'events' that work with HED should be marked that way not creating new data type

@christinerogers
Copy link

Some updates to move forward -
First, our team (thanks @jesscall) added a suggestion to the BEP021 spec here - Thanks for integrating or responding.

Second, to get unstuck on annotation nomenclature, @sappelhoff could you confirm / advise on best BIDS practice /other specs? (We've looked at cardio but would appreciate more eyes on this, as previously discussed.)

  • Non-continuous annotations could be renamed flags per @smakeig's suggestion (above)
  • Continuous annotations could be retained as "annotations"

Thanks - and cc Tyler @Andesha Collins, @SaraStephenson

@CPernet to your last bullet above - I can shoot you the latest on events/HED creation next week.

@smakeig
Copy link

smakeig commented Jul 16, 2022 via email

@dorahermes
Copy link
Member

Another important point for discussion here is that it will be necessary to have a recommendation for how to indicate a list in a .tsv file, here for the channels. This is not specified in BIDS now. Similar point raised in BEP021 spec

Would there be any issue with a simple comma separated list? (e.g. O1, Oz, O2)

@CPernet
Copy link
Collaborator

CPernet commented Aug 5, 2022

'Would there be any issue with a simple comma separated list?' just that BIDS has stuck with tsv :-;
now that I am back in a country that uses commas in numbers I appreciate tsv more ...

@sappelhoff
Copy link
Member

sappelhoff commented Aug 5, 2022

Can somebody briefly summarize what the issue is with the two terms "annotations" versus "flags", please? From my perspective, we are talking about data and event annotations.


Re: "continuous annotations", I would (as many others already have) advise to follow the specification on physio and stim data. That entails:

  • data is saved in a .tsv.gz file without headers
  • headers and additional metadata are desribed in an accompanying .json file
  • different kinds of continuous annotations may be distinguished via entities such as recording
  • This last point leads me to the "problem": the terminology ("physio" and "stim" suffixes, "recording" entity) is not very much in line with annotations. We may consider:
    • adding a new suffix to "physio"+"stim", which however would follow the same rules
    • just reusing "physio" or "stim" for our purposes at the expense of the term being confusing
    • same deliberations go for the "recording" entity

Alternative: Not sure if Scott proposed this above, but we could also specify that continuous annotations must be formatted as a data channel and included in the (derivative) neural data file (BrainVision, EDF, ...). In that case all we might have to do is to come up with a good way of how to specify such "channel types" (e.g., see table in this section)

I find this alternative very attractive, because it makes the "name finding" issue easier (see "problem" / 3 "considerations" above)


Re: specifying which particular "channels" a given event pertains to --> there is a discussion thread in the Gdoc draft: https://docs.google.com/document/d/1PmcVs7vg7Th-cGC-UrX8rAhKUHIzOI-uIOh69_mvdlw/edit?disco=AAAAc8E_W-g

Incorporating comments in that thread, my suggestion is to:

  • have two columns: "channels" and "electrodes", because events may pertain to either channels or electrodes
  • all events that happen may pertain to either:
    • a channel in the data and/or
    • an electrode used in the recording (part of the metadata / data)
    • neither
  • For example if an event pertains to a bipolar channel "C3-F3", then that channel must be included in the data
  • the format of specifying channels and electrodes in their respective TSV columns is "pythonic", that is, we have a list of comma-separated, double-quoted strings, see the example below. I formulate it with "channels" in mind, but suggest the same principles for electrodes:
# two channels are affected
["Cz", "Fpz"]

# no channel is affected 
[]

# the notion of channel is not applicable for this data point
"n/a"

# all channels are affects
"all"

# a channel with the name "all" is affected
["all"]

# a channel that contains double quotes in its name is affected
# note the backslash as escape character
["my_weird\"channel_name", "FCz", "C3"]

# channel names MUST NOT contain "tab characters"
# this would break the TSV format / make it uncomfortable

☝️ depending on the event to be annotated, the above format may be used to indicate channels and/or electrodes in their respective columns


Finally, re: non-continuous annotations --> I am fine with both: including them in the standard *_events.tsv or a separate *_desc-annotations_events.tsv. I think it'd be a good idea to not invent a new datatype / suffix for this (we may already need to do that for continuous annotations, see above)

@Andesha
Copy link

Andesha commented Aug 8, 2022

Thanks for the input @sappelhoff - I'm going to respond in order to your sections...

For my lab, "annotations" vs "flags" is mostly just terminology. We used annotations as a term to encompass all post processed derivative data properties like bridged channels (computationally discovered), rejected ICs, or ICA breakdown quality. Flags to me implies things that are just boolean. I believe on this point we're simply just discussing phrasing.

I believe at this point we can lock in the idea of the continuous annotations being in the external files. There appears to be significant agreement. I would however suggest the adding of a new term as it will likely age better in the long term and may help inform inform the process.

As above, we are likely past packaging things into the data as either "status" channels or other. I'm fine with the external personally.

I am in favour of the idea of adding this information of channel/electrode as an optional column. It would allow our process to fairly easily encode whatever "annotation" (or your chosen term) fairly simply, as long as there's no notion of a "simplest form" requirement. I'm not sure, but how would you envision this extending to ICs @sappelhoff ?

Lastly, I agree. Inventing a new datatype/suffix is not strictly needed.

Some extra thoughts as I was responding:

How does this relate to the provenance (lifecycle?) of a file within BIDS. Does it work naturally or is there things we should be considering from BEP028? Consider the case of a pipeline doing processing, going back and marking up events, adding reaction time info, marking study things, etc... thoughts @christinerogers ?

Thanks!

@smakeig
Copy link

smakeig commented Aug 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests