Details
-
Type:
Improvement
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 2.11.5.0, 2.11.5.1
-
Fix Version/s: 2.12.0
-
Component/s: annotation, file format issue, sequencefeatures
-
Labels:None
Description
Jalview's sequence features/gff files can be useful as 'local annotation databases' - where in earlier versions one could drag/drop a local database of features onto an alignment to annotate the sequences. However there has been some degradation in this functionality:
- GFF3 import now implicitly results in 'THISISAPLACEHOLDER' sequences for all unresolved features, which at best need to be deleted, and at worse result in millions of additional sequences unneccessarily created.
- RelaxedIDMatching (JAL-1537 andJAL-753) is still a hidden preference, but actually seems these days to not cope with some important use cases:
Sequence name in alignment: UNIPROT|H5DT7|PROT_NAME|FOOO
SequenceID in feature: H5DT7
- looking at the code, Jalview should recognise this association but in 2.12 branch it currently does not.
--> suggest ID matcher should create matchings for all words in the name, and then use an ignore list to ignore strings that are not expected to be a sequence ID (e.g. a database name, or general english words).
--> opportunity for semantics/llm query here ? (what are the appropriate IDs for this protein ?)
- GFF3 import now implicitly results in 'THISISAPLACEHOLDER' sequences for all unresolved features, which at best need to be deleted, and at worse result in millions of additional sequences unneccessarily created.
- RelaxedIDMatching (JAL-1537 and
Sequence name in alignment: UNIPROT|H5DT7|PROT_NAME|FOOO
SequenceID in feature: H5DT7
- looking at the code, Jalview should recognise this association but in 2.12 branch it currently does not.
--> suggest ID matcher should create matchings for all words in the name, and then use an ignore list to ignore strings that are not expected to be a sequence ID (e.g. a database name, or general english words).
--> opportunity for semantics/llm query here ? (what are the appropriate IDs for this protein ?)