























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of unix file systems, including the use of directories and subdirectories, and introduces various unix commands for managing files and transferring data between computers. Topics covered include the history of unix file transfer protocol (ftp), file naming conventions, and common file manipulation commands such as rm, mv, cp, and grep.
Typology: Lab Reports
1 / 31
This page cannot be seen from the preview
Don't miss anything!
























BSC4933/5936 Intro’ to BioInfo’ Lab #
Week 1, Tuesday, August 26, 2003
Steve Thompson BioInfo 4U 2538 Winnwood Circle Valdosta, GA, USA 31601- [email protected] 229-249- ¥GCG® (^) is the Genetics Computer Group, part of Accelrys Inc., a subsidiary of Pharmacopeia Inc., producer of the Wisconsin Package®^ for sequence analysis. ” 2003 BioInfo 4U
An often useful analogy is the file cabinet metaphor — your account is analogous to the entire file cabinet. Your directories are like the drawers of the cabinet, and subdirectories are like hanging folders of files within those drawers. Each hanging folder could have a number of manila folders within it, and so on, on down to individual files. Hopefully all arranged with some sort of logical organizational plan. Computers these days are most often connected to other computers in a network, particularly in an academic or industrial setting. These networks consist of computers and a high-speed combination of copper and fiber optic cabling. Sometimes more than one computer is networked together into a configuration known as a cluster where computing power can be spread across the individual members of the cluster (nodes). An extreme example of this is called grid computing where the nodes may be spread all over the world. Individual computers are most often networked to larger computers called servers as well as to each other. The worldwide system of interconnected, networked computers is called the Internet. Various software programs enable computers to communicate with one another across the Internet. Graphics-based browsers, such as Microsoft Explorer or Netscape Navigator, designed to access the World Wide Web (WWW), one part of the Internet, are an example of this type of program, but only one of several that you will see today. Computers only do what they have been programmed to do. Their accuracy entirely depends on the software being used, the data being analyzed, and the manner in which it is used. In scientific biocomputing research, this means that the accuracy and relevancy of your results depends on your understanding of the strengths, weaknesses, and intricacies of both the software employed, and of the biological system being studied. My Example Protein System I use members of the same dataset throughout the course’s lab tutorial examples to make them more interesting and to provide a common focused objective. You will be doing the same starting next week with your choice of one of four course ‘project’ molecules. It is somewhat analogous to what you would do in an actual laboratory setting and will provide a basic framework on which you can build. My example molecule is the very well characterized and vitally important protein Elongation Factor 1a. The Elongation Factors are a vital protein family crucial to protein biosynthesis. They are ubiquitous to all of cellular life and, in concert with the ribosome, they must have been one of the very earliest enzymatic factories to evolve. Three distinct subtypes of elongation factors all work together to help perform the vital function of protein biosynthesis. In [Eu]Bacteria and Eukaryota nuclear genomes they have the following names (the nomenclature in Archaea has not been completely worked out and is often contradictory): Eukaryota [Eu]Bacteria Function EF-1a EF-Tu^ Binds GTP and an aminoacyl-tRNA; delivers the latter to the A site of ribosomes. EF-1b EF-Ts Interacts with EF-1a/Tu to displace GDP and thus allows the regeneration of GTP-EF-1a/Tu EF-2 EF-G Binds GTP and peptidyl-tRNA and translocates the latter from the A site to the P site.
The Elongation Factor subunit 1-Alpha (EF-1a) in Eukaryota and most Archaea (called Elongation Factor Tu in [Eu]Bacteria [and Euk’ and Arch’ plastids]) has guanine nucleotide, ribosome, and aminoacyl-tRNA binding sites, and is essential to the universal process of protein biosynthesis, promoting the GTP-dependent binding of aminoacyl-tRNA to the A-site of the intact ribosome. The hydrolysis of GTP to GDP mediates a conformational change in a specific region of the molecule. This region is conserved in both EF-1a/Tu and EF-2/G and seems to be typical of GTP-dependent proteins which bind non-initiator tRNAs to the ribosome. In E. coli EF-Tu is encoded by a duplicated loci, tufA and tufB located about 15 minutes apart on the chromosome at positions 74.92 and 90.02 (ECDC). In humans at least twenty loci on seven different chromosomes demonstrate homology to the gene. However, only two of them are potentially active; the remainder appear to be retropseudogenes (Madsen, et al., 1990). It is encoded in both the nucleus and mitochondria and chloroplast genomes in eukaryotes and is a globular, cytoplasmic enzyme in all life forms. The three-dimensional structure of Elongation Factor 1 a/Tu has been solved in more than fifteen cases. Partial and complete E. coli structures have been resolved and deposited in the Protein Data Bank (1EFM, 1ETU, 1DG1, 1EFU, and 1EFC), the complete Thermus aquaticus and Thermus thermophilus structures have been determined (1TTT, 1EFT, and 1AIP), and even cow EF-1a has had its structure determined (1D2E). Most of the structures show the protein in complex with its nucleotide ligand, some show the terniary complex. The Thermus aquaticus structure is shown below as drawn by NCBI’s Cn3D molecular visualization tool: Thermus aquaticus EF-Tu: 1EFT
and secure copy. It’s also included in all modern UNIX OSs but not in pre OS X Macs nor in MS Windows. An implementation of these programs is also available on all the Conradi and CSIT computers. Furthermore, since ssh is strictly a non-graphical terminal program, and since all Web browsers’ graphics capability is inadequate for the truly interactive graphics that much biocomputing software requires, another type of graphical system needs to be present on the computer that you use for this course. That graphical interface is called the X Window System. It was developed at MIT (the Massachusetts Institute of Technology) in the 1980’s, back in the early days of UNIX, as a distributed, hardware independent way of exchanging graphical information between different UNIX computers. Unfortunately the X worldview is a bit backwards from the standard client/server computing model. In the standard model a local client, for instance a Web browser, displays information from a remote server, for instance a particular WWW site, also called a Uniform Resource Locator (URL). In the world of X, an X-server program on the machine that you are sitting at (the local machine) displays the graphics from an X-client program that could be located on either your own machine or on a remote server machine that you are connected to. Confused yet? Nearly all UNIX computers, including Linux, but not including Mac OS X, include a genuine X Window System in their default configuration. MS Windows computers, including the ones in the Conradi Lab, are often loaded with X-server emulation software, such as the commercial program XWin32 or eXceed, to provide X- server functionality. Macintosh computers prior to OS X required a commercial X solution; often the program MacX or eXodus was used. However, since OS X Macs are true UNIX machines, they can use a variety of free public domain packages to provide true X Windowing. This is being done on the Conradi Lab Macs. Florida State University’s main biocomputing server for sequence analysis is a Dell PowerEdge 6650 named Mendel bought with Howard Hughes Medical Institute monies from the recently successful Biology Department undergraduate education proposal. Mendel has four 1.6 GH Intel Xeon CPUs, eight GB of RAM, and over 700 GB of storage. This machine (mendel.csit.fsu.edu) is managed by, and located in, the School of Computational Science and Information Technology (CSIT), and runs RedHat Linux version 7.2. RedHat Linux is a commercial distribution of the free UNIX derived, Open Source Linux OS. Linux was invented in the early 1990’s by a student at the University of Helsinki in Finland named Linus Torvalds as a ‘hobby.’ Mendel only allows ssh, scp, and sftp connections. In order to display X Windows on your local computer you will need to allow ssh X tunneling. While in this course you will learn what this means and how to use much of the biocomputing software installed on Mendel. You have been issued an account on this machine by merit of your enrollment in this course. Other local servers that we will connect to today include the URLs http://bio.fsu.edu/ and http://www.csit.fsu.edu/. Becoming familiar with these three machines is the main objective of today’s lab. Week One Tutorial: Exploring the FSU Biocomputing Environment WWW Browsers and other Local Programs
So how do you do bioinformatics? Often bioinformatics is done on the Internet through the WWW. This is possible and easy and fun, but, beside being a bit too easy too get sidetracked upon, the Web can not readily handle large datasets or large multiple sequence alignments. These sort of datasets quickly become intractable in the Web environment. You’ll know you’re there when you try. In spite of that, let’s begin with a Web resource designed specifically for this course. Activate the computer that you are sitting at by moving its mouse or by pressing the return key on its keyboard. You may then have to log onto it with an appropriate user ID and password. This is not your Mendel account; it is whatever account gives you access to that lab’s computers. If you don’t know this information, contact a lab instructor as soon as possible. Now launch a Web browser by selecting the appropriate icon. If the icon is on the desktop, a <double-click> will launch it, if it’s in a Mac, MS Window, or UNIX menu, a single
password. The “ -X ” (note that the –X is capitalized!) option is necessary to allow ‘X tunneling’ and set up your X environment. This is the only encrypted, secure way to make X connections, and is required by Mendel, if you want to use any resources that require X windows. If you are using a GUI version of ssh, then the X tunneling option should be turned on by default. Further details of X on Mendel will not be fully covered in this tutorial. There are too many variables depending on your local machine — we’ll just go over the key concepts. If this isn’t enough, ask me for further assistance. I’m also available for personal help in your own laboratories. If you are having trouble using Mendel from there, just contact me at [email protected]. Regardless of the ssh method used to launch and connect to Mendel, you should now have an interactive command line terminal session running on Mendel in a separate window on your local machine’s desktop. Mendel’s OS checked your username and password, and if correct, ran your default shell program and any startup scripts and then returned the system prompt. The shell program is your interface to the UNIX OS. It interprets and executes the commands that you type. Common UNIX shells include bash, Korn, the C shell, and a popular C shell derivative that Mendel users run by default called tcsh. Tcsh, like bash, enables command history recall using the keyboard arrow keys, accepts tab word completion, and allows command line editing. Upon logging in, you end up in your ‘home directory,’ that portion of Mendel’s hard drive disk space reserved just for you and designated by you from anywhere on the system with the character string “$HOME”. This is called an “environment variable.” You should see a screen trace similar to the following upon logging in: Welcome to the WISCONSIN PACKAGE Version 10.3-UNIX Installed on linux Copyright (c) 1982 - 2001, Accelrys Inc. A wholly owned subsidiary of Pharmacopeia, Inc. All rights reserved. Published research assisted by this software should cite: Wisconsin Package Version 10.3, Accelrys Inc., San Diego, CA Databases available: GenBank/GenBank Tags Release 135.0 (4/2003) GenPept translations Release 135.0 (4/2003) Genome (H. sapiens) Build 33.0 (4/2003) NRL_3D PDB sequences Release 28.0 (9/2000) PIR-Protein Release 76.0 (3/2003) SWISS-PROT Release 41.5 (4/2003) SP-TREMBL Release 23.7 (4/2003) PROSITE Release 17.2 (7/2002) Restriction Enzymes (REBASE) 10.0 (10/2000) Technical support see: http://www.accelrys.com/support/ Online help: % genhelp or genmanual or http://www.csit.fsu.edu/gcg/ with off-campus restricted access: user - gcg, password - stevet thompson@mendel > The screen trace shows the version numbers for the Wisconsin Package (much more on this topic later today) and of all its online databases on Mendel. The system prompt displays the user and the machine name and waits to receive a command. On different UNIX systems the prompt may appear different ways depending on
how the system administrator has set up the user environment. Here I will just use the ‘greater than’ sign (>) to represent the system prompt. It should not be typed as part of the command. In command line mode each command is terminated by the ‘return’ or ‘enter’ key. UNIX uses the ASCII character set and unlike some OSs, it supports both upper and lower case. A disadvantage of using both upper and lower case is that commands and file names must be typed in the correct case. Most UNIX commands and file names are in lower case. Commands and file names should not include spaces nor any punctuation other than periods (.), hyphens (–), or underscores (_). UNIX command options are specified by a required space and the hyphen character ( - ). UNIX does not use or directly support function keys. Special functions are generally invoked using the ‘Control’ key. For example a running command can be aborted by pressing the “Control” key [sometimes labeled “CTRL” or denoted with the karat symbol (^)] and the letter “c.” The short form for this is generally written CTRL-C or ^C. Using control keys instead of special function keys for special commands is sometimes difficult to remember, the advantage is that nearly every terminal program supports the control key, allowing UNIX to be used from a wide variety of different platforms that might connect to the server. The general command syntax for UNIX is a command followed by some options, and then some parameters. If a command reads input, the default input for the command will generally come from the interactive terminal. The output from a system level command (if any) will generally be printed to your terminal. The command syntax allows the input and outputs for a program to be redirected into a file or the output of one program can be passed as the input to another program. General command syntax follows: cmd cmd -options cmd -options parameters To cause a command to read from a file rather than from the terminal, the “<” sign is used on the command line and the “>” sign causes the program to write its output to a file (for those programs that do not do this by default): cmd -options parameters < input cmd -options parameters > output cmd -options parameters < input > output To cause the output from one program to be passed to another program as input a vertical bar (|), known as the “pipe,” is used. cmd1 -options parameters | cmd This feature is called “piping” the output of one program into the input of another.
thompson@mendel > passwd Changing password for thompson. Old password: New password: Retype new password: Write down your new password along beside your Mendel account name. Don’t forget! When an account is created, your home directory environment variable, “$HOME,” is created and associated with that account. In any tree structured file system the concept of where you are in the tree is very important. There are two ways of specifying where things are. You can refer to things relative to your current directory or by its complete ‘path’ name. When the complete path name is given by beginning the specification with a slash, the current position in the directory tree is ignored. To find the complete path in Mendel’s file system to your home directory type the command “ pwd :” thompson@mendel > pwd /home/thompson This UNIX command shows you where you are presently located on the server. It displays the complete UNIX path specification (this always starts with a slash) for the directory structure of your account. Also notice that UNIX uses forward slashes (/) to differentiate between subdirectories, not backward slashes () like MS-DOS. The pwd command can be used at any point to keep track of your location. Several commands for working with your directory structure follow: pwd print working directory ls list the contents of the directory mkdir make a new directory cd change directory To list the files in your home directory, use the “ ls ” command. There are many options to the ls command. Check them out by typing “ man ls ”. The most useful options are the “ -l ” option and the “ -a ” options. The command “ ls -l ” will list the files and directories in your current directory in the ‘long’ form with extended information. A UNIX convention is that files with a period as the first character in their name are not listed by the ls command unless the “ -a ” ‘all’ option is given. This convention has lead to a number of special configuration files having periods as the first character in their name. Some of these files are executed automatically when a user logs in, much like “AUTOEXEC.BAT” and “CONFIG.SYS” are executed in MS-DOS upon log in. On many UNIX systems there is a file executed upon every login called “.login” and another one that sets up the shell environment called “.cshrc”. In general you do not want to mess with these files in your account until you are very comfortable with the OS. Following are three examples of the ls command in my account: thompson@mendel > ls bin gcg mail patterns seqlab temp.epsf tutorials
db_info login.bak molevol ribo_files snap_files temp.ps working thompson@mendel > ls -l total 80 drwxr-xr-x 3 thompson gcg 4096 Feb 22 2002 bin drwxr-xr-x 2 thompson gcg 4096 Jan 16 2001 db_info drwxr-xr-x 2 thompson gcg 4096 Dec 11 18:05 gcg -rwxr-xr-x 1 thompson gcg 1797 Jun 8 1998 login.bak drwx------ 2 thompson gcg 4096 Mar 8 2002 mail drwxr-xr-x 9 thompson gcg 4096 Aug 16 09:43 molevol drwxr-xr-x 4 thompson gcg 4096 Jun 3 1999 patterns drwxr-xr-x 15 thompson gcg 4096 Oct 16 2001 ribo_files drwxrwxr-x 2 thompson gcg 4096 Nov 14 10:34 seqlab drwxr-xr-x 5 thompson gcg 4096 Oct 16 2001 snap_files -rw-r--r-- 1 thompson gcg 21798 Nov 14 11:42 temp.epsf -rw-r--r-- 1 thompson gcg 5724 Nov 13 20:52 temp.ps drwxr-xr-x 6 thompson gcg 4096 Apr 30 2002 tutorials drwxr-xr-x 12 thompson gcg 4096 Apr 8 2002 working thompson@mendel > ls -a
. .forward molevol .seqlab-history temp.ps .. gcg .netscape .seqlab-mendel tutorials .bash_history .gcgmydevices patterns .sh_history working bin .gcgmydevices.old .pauphistory snap_files .Xauthority .cshrc.OFF login.bak .pinerc .ssh db_info .login.OFF .profile.OFF .ssh .dt .login.ORIGINAL ribo_files .sysman .dtprofile mail seqlab temp.epsf In the output from “ ls –l ” additional information regarding the file permissions, owner of the file, size, modification date, and file name is shown. In the output from “ ls –a ” those ‘dot’ system files are now seen. Nearly all OSs have some way to customize your login environment with editable configuration files; these are them. The experienced user can place commands in these special files to customize their individual login environment. Subdirectories are generally used to group files associated with one particular project or files of a particular type. For example, you might store all of your memorandums in a directory called “memo.” The “ mkdir ” command is used to create directories and the “ cd ” command is used to move into directories. The special placeholder file “ .. ” allows you to move back up the directory tree. Note its use below with the cd command to go back up to the parent of the current directory: thompson@mendel > mkdir memo thompson@mendel > ls bin gcg mail molevol ribo_files snap_files temp.ps working db_info login.bak memo patterns seqlab temp.epsf tutorials thompson@mendel > cd memo thompson@mendel > pwd /home/thompson/memo thompson@mendel > cd .. thompson@mendel > pwd /home/thompson
Commands for looking at the system, other users, your login sessions, jobs you are running, and command execution: uptime Shows the time since the system was last rebooted. Also shows the “load average”. Load average indicates the number of jobs in the system ready to run. The higher the load average the slower the system will run. w or who Shows who is logged in to the system doing what. top Shows the most active processes on the entire machine and the portion of CPU cycles assigned to running processes. Press “ q ” to quit. ps Shows your current processes and their status (running, sleeping, idle, terminated, etc.); (use the man ps pages as options widely vary, see especially the a, e, l, f, u, and U options). at Submit script to the at queue for execution later. bg Resumes a suspended job in background mode. fg Brings a background job back into interactive mode. The following commands affect the file system and access files. The basic file commands: cat tmp.txt Shows the contents of the file “ tmp.txt ” on your screen, also concatenates files, for example: “cat file1 file2 > file3.” more tmp.txt Shows the contents of the file “ tmp.txt ” at the terminal one page at a time; press the space bar to continue. Type a “? ” when the scrolling stops for viewing options (less often available; it is more powerful than more). head tmp.txt Shows the first few lines of the file “ tmp.txt .” tail tmp.txt Show the last few lines of the file “ tmp.txt .” grep xterm tmp.txt Show the lines in the “ tmp.txt ” that contain the specified pattern, here the word “ xterm .” wc tmp.txt Counts the number of characters, words, and lines in the file “ tmp.txt .” cp tmp.txt tmp Copies the file “ tmp.txt ” to the file “ tmp .” Any previous contents of the file “ tmp ” are lost. mv tmp.txt tmp Renames (moves) the file “ tmp.txt ” to the file “ tmp .” Any previous contents of the file “ tmp ” are lost. mv tmp memo Since “ memo ” is a directory name not a file name, this command moves the specified file, “ tmp ,” into the specified directory, “ memo ,” keeping the original file name intact. rm memo/tmp Deletes (removes) the file “ tmp ” in the directory “ memo .” It is unrecoverable! chmod perm Changes the permissions of a file. See “man chmod” and also “man chown” for further details.
lpr file Prints the specified file on the default system printer. Will need to specify a particular print queue with the “-P” option to send it elsewhere. Directory commands: pwd Print Working Directory. Shows you where you are at in the file system. Very useful when you get confused. (Also see “whoami” if really confused!) ls Shows (lists) your files’ names. ls -l Shows your files’ names in extended (long) format including file size, ownership, and permissions. ls -al Shows all files including the system files (.files) in your directory in the long format. mkdir newdir Makes a new directory in your current directory. rmdir newdir Removes a subdirectory from your current directory. Directory must be empty to remove the directory. rm -r dir Removes all the files, and subdirectories of a directory and then removes the directory. Very convenient, useful and dangerous. cd Move back into your home directory from anywhere. cd memo Move down into a directory named “ memo ” from your current directory. Usually it is best to leave programs using the quit or exit commands; however, occasionally it is necessary to terminate a running program. Here are some useful commands for doing this. Commands for bailing out of programs:
biotech endeavors of the last several years. You saw their introductory welcome screen seen when you first logged on to Mendel. That process also initialized the Package’s user environment. The Genetics Computer Group The Wisconsin Package for Sequence Analysis began as a service project in 1982 in Oliver Smithies’ lab in the Genetics Department at the University of Wisconsin, Madison. It spun off that effort into a University Research Park location becoming an independent private company, the Genetics Computer Group (GCG), in
Data Environment: GDE” (1994). Many people were very impressed and he made it freely available. Coincidentally GCG realized the need for some sort of a ‘point-and-click’ environment for their system. They were losing lots of business, only being able to provide a command line interface. Therefore, they started trying to develop a GUI for the Wisconsin Package and released it in 1994. They called it the “Wisconsin Package Interface,” WPI for short. Few were impressed. It only provided a menu to their programs, hardly anything more than the “-check” command line option they’ve always had. So they did a natural and very smart thing. They hired Steve Smith away from Millipore, where he had recently moved, into their company, so that he could merge his GDE with their WPI. The late 1996 offspring was SeqLab, and, thank goodness, they threw away the acronyms (GDE + WPI = SeqLab). As ‘they’ say “The rest is history” and once more GCG’s customers are (generally) happy. Even though it’s not perfect, once you gain an appreciation for SeqLab’s power and ease of use, I don’t think you’ll be satisfied with any other sequence analysis system. I know I’m not. Using the Wisconsin Package — Specifying Sequences and Logical Terms! Before launching into SeqLab, let’s go over an important, central ‘idea’ of the Wisconsin Package. One of the most difficult aspects of the Package for new users to get used to is how to tell the programs what sequences you want to work with. GCG calls this “specifying sequences” and it’s crucial to understanding the way their programs work. Once you’ve become comfortable with these concepts, so many of the frustrations commonly encountered with the Package will disappear. So, to answer the always perplexing GCG question “What sequence(s)?... .” the four ways of specifying sequences, in order of increasing power and complexity:
1. The sequence is in a local GCG format single sequence file in your UNIX account. This sequence file can be anywhere in your account as long as you supply an appropriate path so that the program can find the file. The sequence file can have any name but it is best to use extensions that tell you what type of molecule it is, e.g. “.seq” and “.pep” (e.g. “my.pep” or “~/subdir/my.seg”). Use the program reformat to convert ‘raw’ text format files to GCG format. Several GCG From and To programs are also available for specific data format conversions, and SeqLab’s Editor Mode can directly “Import” native GenBank and ABI style trace format files without the need to reformat. This is a small example of 'raw' GCG single sequence format. Always put some documentation on top, so in the future you can figure out what it is you're dealing with! Two periods always separate that documentation from the actual data. .. ACTGACGTCACATACTGGGACTGAGATTTACCGAGTTATACAAGTATACAGATTTAATAGCATGCGATCCCATGGGA Next, the clean GCG format single sequence file after reformat: This is a small example of GCG single sequence format. Always put some documentation on top, so in the future you can figure out what it is you're dealing with! The line with the two periods is converted to the checksum line.